Discussion:
[dpdk-dev] [RFC 0/5] virtio support for container
Jianfeng Tan
2015-11-05 18:31:11 UTC
Permalink
This patchset only acts as a PoC to request the community for comments.

This patchset is to provide high performance networking interface
(virtio) for container-based DPDK applications. The way of starting
DPDK applications in containers with ownership of NIC devices
exclusively is beyond the scope. The basic idea here is to present
a new virtual device (named eth_cvio), which can be discovered
and initialized in container-based DPDK applications rte_eal_init().
To minimize the change, we reuse already-existing virtio frontend
driver code (driver/net/virtio/).

Compared to QEMU/VM case, virtio device framework (translates I/O
port r/w operations into unix socket/cuse protocol, which is originally
provided in QEMU), is integrated in virtio frontend driver. Aka, this
new converged driver actually plays the role of original frontend
driver and the role of QEMU device framework.

The biggest difference here lies in how to calculate relative address
for backend. The principle of virtio is that: based on one or multiple
shared memory segments, vhost maintains a reference system with
the base addresses and length of these segments so that an address
from VM comes (usually GPA, Guest Physical Address), vhost can
translate it into self-recognizable address (aka VVA, Vhost Virtual
Address). To decrease the overhead of address translation, we should
maintain as few segments as better. In the context of virtual machines,
GPA is always locally continuous. So it's a good choice. In container's
case, CVA (Container Virtual Address) can be used. This means that:
a. when set_base_addr, CVA address is used; b. when preparing RX's
descriptors, CVA address is used; c. when transmitting packets, CVA is
filled in TX's descriptors; d. in TX and CQ's header, CVA is used.

How to share memory? In VM's case, qemu always shares all physical
layout to backend. But it's not feasible for a container, as a process,
to share all virtual memory regions to backend. So only specified
virtual memory regions (type is shared) are sent to backend. It leads
to a limitation that only addresses in these areas can be used to
transmit or receive packets. For now, the shared memory is created
in /dev/shm using shm_open() in the memory initialization process.

How to use?

a. Apply the patch of virtio for container. We need two copies of
patched code (referred as dpdk-app/ and dpdk-vhost/)

b. To compile container apps:
$: cd dpdk-app
$: vim config/common_linuxapp (uncomment "CONFIG_RTE_VIRTIO_VDEV=y")
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc

c. To build a docker image using Dockerfile below.
$: cat ./Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0xc", "-n", "4", "--no-huge", "--no-pci", "--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/var/run/usvhost", "--", "-p", "0x1"]
$: docker build -t dpdk-app-l2fwd .

d. To compile vhost:
$: cd dpdk-vhost
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc

e. Start vhost-switch
$: ./examples/vhost/build/vhost-switch -c 3 -n 4 --socket-mem 1024,1024 -- -p 0x1 --stats 1

f. Start docker
$: docker run -i -t -v <path to vhost unix socket>:/var/run/usvhost dpdk-app-l2fwd

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>

Jianfeng Tan (5):
virtio/container: add handler for ioport rd/wr
virtio/container: add a new virtual device named eth_cvio
virtio/container: unify desc->addr assignment
virtio/container: adjust memory initialization process
vhost/container: change mode of vhost listening socket

config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost-user.c | 433 +++++++++++++++++++++++++++
drivers/net/virtio/vhost-user.h | 137 +++++++++
drivers/net/virtio/virtio_ethdev.c | 319 +++++++++++++++-----
drivers/net/virtio/virtio_ethdev.h | 16 +
drivers/net/virtio/virtio_pci.h | 32 +-
drivers/net/virtio/virtio_rxtx.c | 9 +-
drivers/net/virtio/virtio_rxtx_simple.c | 9 +-
drivers/net/virtio/virtqueue.h | 9 +-
lib/librte_eal/common/include/rte_memory.h | 5 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 58 +++-
lib/librte_mempool/rte_mempool.c | 16 +-
lib/librte_vhost/vhost_user/vhost-net-user.c | 5 +
14 files changed, 967 insertions(+), 90 deletions(-)
create mode 100644 drivers/net/virtio/vhost-user.c
create mode 100644 drivers/net/virtio/vhost-user.h
--
2.1.4
Jianfeng Tan
2015-11-05 18:31:12 UTC
Permalink
Add handler to turn ioport rd/wr into vhost user unix socket msgs.
Add field, like kickfd, callfd in struct virtio_hw.
Add CONFIG_RTE_VIRTIO_VDEV to control virtio vdev, disabled by default.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost-user.c | 433 ++++++++++++++++++++++++++++++++++++++++
drivers/net/virtio/vhost-user.h | 137 +++++++++++++
drivers/net/virtio/virtio_pci.h | 32 ++-
5 files changed, 610 insertions(+), 1 deletion(-)
create mode 100644 drivers/net/virtio/vhost-user.c
create mode 100644 drivers/net/virtio/vhost-user.h

diff --git a/config/common_linuxapp b/config/common_linuxapp
index c1d4bbd..99dd348 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -477,3 +477,8 @@ CONFIG_RTE_APP_TEST=y
CONFIG_RTE_TEST_PMD=y
CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=n
CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=n
+
+#
+# Enable virtio support for container
+#
+#CONFIG_RTE_VIRTIO_VDEV=y
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..dddf125 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

+ifeq ($(CONFIG_RTE_VIRTIO_VDEV),y)
+ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += vhost-user.c
+endif
+
# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/vhost-user.c b/drivers/net/virtio/vhost-user.c
new file mode 100644
index 0000000..d0960ce
--- /dev/null
+++ b/drivers/net/virtio/vhost-user.c
@@ -0,0 +1,433 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <stdint.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+
+#include <rte_mbuf.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "vhost-user.h"
+
+static int
+vhost_user_read(int fd, void *buf, int len, int *fds, int fd_num)
+{
+ struct msghdr msgh;
+ struct iovec iov;
+ int r;
+
+ size_t fd_size = fd_num * sizeof(int);
+ char control[CMSG_SPACE(fd_size)];
+ struct cmsghdr *cmsg;
+
+ memset(&msgh, 0, sizeof(msgh));
+ memset(control, 0, sizeof(control));
+
+ iov.iov_base = (uint8_t *)buf;
+ iov.iov_len = len;
+
+ msgh.msg_iov = &iov;
+ msgh.msg_iovlen = 1;
+
+ msgh.msg_control = control;
+ msgh.msg_controllen = sizeof(control);
+
+ cmsg = CMSG_FIRSTHDR(&msgh);
+
+ cmsg->cmsg_len = CMSG_LEN(fd_size);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ memcpy(CMSG_DATA(cmsg), fds, fd_size);
+
+ do {
+ r = sendmsg(fd, &msgh, 0);
+ } while (r < 0 && errno == EINTR);
+
+ return r;
+}
+
+static int
+vhost_user_write(int fd, VhostUserMsg *msg)
+{
+ uint32_t valid_flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+ int ret, sz_hdr = VHOST_USER_HDR_SIZE, sz_payload;
+
+
+ ret = recv(fd, (void *)msg, sz_hdr, 0);
+ if (ret < sz_hdr) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg hdr: %d instead of %d.",
+ ret, sz_hdr);
+ goto fail;
+ }
+
+ /* validate msg flags */
+ if (msg->flags != (valid_flags)) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg: flags 0x%x instead of 0x%x.",
+ msg->flags, valid_flags);
+ goto fail;
+ }
+
+ sz_payload = msg->size;
+ if (sz_payload) {
+ ret = recv(fd, (void *)((uint8_t*)msg + sz_hdr), sz_payload, 0);
+ if (ret < sz_payload) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg payload: %d instead of %d.",
+ ret, msg->size);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+fail:
+ return -1;
+}
+
+static VhostUserMsg m __attribute__ ((unused));
+static int
+vhost_user_sendmsg(struct virtio_hw *hw, VhostUserRequest req, void *arg)
+{
+ VhostUserMsg msg;
+ VhostUserMemoryRegion *mr;
+ struct vhost_vring_file *file = 0;
+ int need_reply = 0;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+ ssize_t fd_num = 0;
+ int len;
+
+ msg.request = req;
+ msg.flags = VHOST_USER_VERSION;
+ msg.size = 0;
+
+ switch (req) {
+ case VHOST_USER_GET_FEATURES:
+ need_reply = 1;
+ break;
+
+ case VHOST_USER_SET_FEATURES:
+ case VHOST_USER_SET_LOG_BASE:
+ msg.payload.u64 = *((__u64 *)arg);
+ msg.size = sizeof(m.payload.u64);
+ break;
+
+ case VHOST_USER_SET_MEM_TABLE:
+ {
+ int fd;
+ void *addr;
+ uint64_t size;
+
+ rte_memseg_info_get(0, &fd, &size, &addr);
+
+ mr = &msg.payload.memory.regions[0];
+ mr->userspace_addr = (uint64_t)addr;
+ mr->memory_size = size;
+ /* to keep continuity, use virtual address here */
+ mr->guest_phys_addr = (uint64_t)addr;
+ mr->mmap_offset = 0;
+ fds[fd_num++] = fd;
+ msg.payload.memory.nregions = 1;
+
+ msg.size = sizeof(m.payload.memory.nregions);
+ msg.size += sizeof(m.payload.memory.padding);
+ msg.size += fd_num * sizeof(VhostUserMemoryRegion);
+
+ break;
+ }
+ case VHOST_USER_SET_LOG_FD:
+ fds[fd_num++] = *((int *)arg);
+ break;
+
+ case VHOST_USER_SET_VRING_NUM:
+ case VHOST_USER_SET_VRING_BASE:
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ break;
+
+ case VHOST_USER_GET_VRING_BASE:
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ need_reply = 1;
+ break;
+
+ case VHOST_USER_SET_VRING_ADDR:
+ memcpy(&msg.payload.addr, arg, sizeof(struct vhost_vring_addr));
+ msg.size = sizeof(m.payload.addr);
+ break;
+
+ case VHOST_USER_SET_VRING_KICK:
+ case VHOST_USER_SET_VRING_CALL:
+ case VHOST_USER_SET_VRING_ERR:
+ file = arg;
+ msg.payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK;
+ msg.size = sizeof(m.payload.u64);
+ if (file->fd > 0)
+ fds[fd_num++] = file->fd;
+ else
+ msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK;
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "vhost-user trying to send unhandled ioctl");
+ return -1;
+ }
+
+ len = VHOST_USER_HDR_SIZE + msg.size;
+ if (vhost_user_read(hw->sockfd, &msg, len, fds, fd_num) < 0)
+ return 0;
+
+ if (need_reply) {
+ if (vhost_user_write(hw->sockfd, &msg) < 0)
+ return -1;
+
+ if (req != msg.request) {
+ PMD_DRV_LOG(ERR, "Received unexpected msg type."
+ " Expected %d received %d",
+ req, msg.request);
+ return -1;
+ }
+
+ switch (req) {
+ case VHOST_USER_GET_FEATURES:
+ if (msg.size != sizeof(m.payload.u64)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ *((__u64 *)arg) = msg.payload.u64;
+ break;
+ case VHOST_USER_GET_VRING_BASE:
+ if (msg.size != sizeof(m.payload.state)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ memcpy(arg, &msg.payload.state, sizeof(struct vhost_vring_state));
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+static void
+kick_one_vq(struct virtio_hw *hw, struct virtqueue *vq, unsigned queue_sel)
+{
+ struct vhost_vring_file file;
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_CALL comes
+ * firstly because vhost depends on this msg to allocate virtqueue
+ * pair.
+ */
+ file.index = queue_sel;
+ file.fd = hw->callfd;
+ vhost_user_sendmsg(hw, VHOST_USER_SET_VRING_CALL, &file);
+
+ struct vhost_vring_state state;
+ state.index = queue_sel;
+ state.num = vq->vq_ring.num;
+ vhost_user_sendmsg(hw, VHOST_USER_SET_VRING_NUM, &state);
+
+ state.num = 0; /* no reservation */
+ vhost_user_sendmsg(hw, VHOST_USER_SET_VRING_BASE, &state);
+
+ struct vhost_vring_addr addr = {
+ .index = queue_sel,
+ .desc_user_addr = (uint64_t)vq->vq_ring.desc,
+ .avail_user_addr = (uint64_t)vq->vq_ring.avail,
+ .used_user_addr = (uint64_t)vq->vq_ring.used,
+ .log_guest_addr = 0,
+ .flags = 0, /* disable log */
+ };
+ vhost_user_sendmsg(hw, VHOST_USER_SET_VRING_ADDR, &addr);
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_KICK comes
+ * lastly because vhost depends on this msg to judge if
+ * virtio_is_ready().
+ */
+ file.fd = hw->kickfd;
+ vhost_user_sendmsg(hw, VHOST_USER_SET_VRING_KICK, &file);
+}
+
+static void kick_all_vq(struct virtio_hw *hw)
+{
+ unsigned i, queue_sel;
+ struct rte_eth_dev_data *data = hw->data;
+
+ vhost_user_sendmsg(hw, VHOST_USER_SET_MEM_TABLE, NULL);
+
+ for (i = 0; i < data->nb_rx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_RQ_QUEUE_IDX;
+ kick_one_vq(hw, data->rx_queues[i], queue_sel);
+ }
+ for (i = 0; i < data->nb_tx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_TQ_QUEUE_IDX;
+ kick_one_vq(hw, data->tx_queues[i], queue_sel);
+ }
+}
+
+void
+virtio_ioport_write(struct virtio_hw *hw, uint64_t addr, uint32_t val)
+{
+ uint64_t guest_features;
+
+ switch (addr) {
+ case VIRTIO_PCI_GUEST_FEATURES:
+ hw->guest_features = val;
+ guest_features = val;
+ vhost_user_sendmsg(hw, VHOST_USER_SET_FEATURES, &guest_features);
+ break;
+ case VIRTIO_PCI_QUEUE_PFN:
+ /* do nothing */
+ break;
+ case VIRTIO_PCI_QUEUE_SEL:
+ hw->queue_sel = val;
+ break;
+ case VIRTIO_PCI_STATUS:
+ if (val & VIRTIO_CONFIG_S_DRIVER_OK)
+ kick_all_vq(hw);
+ hw->status = val & 0xFF;
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "unexpected address %"PRIu64" value 0x%x\n",
+ addr, val);
+ break;
+ }
+}
+
+uint32_t
+virtio_ioport_read(struct virtio_hw *hw, uint64_t addr)
+{
+ uint32_t ret = 0xFFFFFFFF;
+ uint64_t host_features;
+
+ PMD_DRV_LOG(INFO, "addr: %"PRIu64"\n", addr);
+
+ switch (addr) {
+ case VIRTIO_PCI_HOST_FEATURES:
+ vhost_user_sendmsg(hw, VHOST_USER_GET_FEATURES, &host_features);
+ ret = host_features;
+ break;
+ case VIRTIO_PCI_GUEST_FEATURES:
+ ret = hw->guest_features;
+ break;
+ case VIRTIO_PCI_QUEUE_PFN:
+ PMD_DRV_LOG(ERR, "VIRTIO_PCI_QUEUE_PFN (r) not supported\n");
+ break;
+ case VIRTIO_PCI_QUEUE_NUM:
+ ret = hw->queue_num;
+ break;
+ case VIRTIO_PCI_QUEUE_SEL:
+ ret = hw->queue_sel;
+ break;
+ case VIRTIO_PCI_STATUS:
+ ret = hw->status;
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "%"PRIu64" (r) not supported\n", addr);
+ break;
+ }
+
+ return ret;
+}
+
+int
+virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num)
+{
+ int flag;
+ int sockfd, callfd, kickfd;
+ struct sockaddr_un un;
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+
+ /* TODO: cq */
+
+ sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
+
+ if (sockfd < 0) {
+ PMD_DRV_LOG(ERR, "socket error, %s\n", strerror(errno));
+ exit(-1);
+ }
+ flag = fcntl(sockfd, F_GETFD);
+ fcntl(sockfd, F_SETFD, flag | FD_CLOEXEC);
+
+ memset(&un, 0, sizeof(un));
+ un.sun_family = AF_UNIX;
+ snprintf(un.sun_path, sizeof(un.sun_path), "%s", hw->path);
+ if (connect(sockfd, (struct sockaddr *)&un, sizeof(un)) < 0) {
+ PMD_DRV_LOG(ERR, "connect error, %s\n", strerror(errno));
+ exit(-1);
+ }
+ hw->sockfd = sockfd;
+
+ /* or use invalid flag to disable it, but vhost-dpdk uses this to judge
+ * if dev is alive. so finally we need two real event_fds.
+ */
+ callfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (callfd < 0) {
+ PMD_DRV_LOG(ERR, "callfd error, %s\n", strerror(errno));
+ exit(-1);
+ }
+ hw->callfd = callfd;
+
+ kickfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (kickfd < 0) {
+ PMD_DRV_LOG(ERR, "callfd error, %s\n", strerror(errno));
+ exit(-1);
+ }
+ hw->kickfd = kickfd;
+
+ /* VHOST_USER_SET_OWNER */
+ vhost_user_sendmsg(hw, VHOST_USER_SET_OWNER, NULL);
+
+ return 0;
+}
diff --git a/drivers/net/virtio/vhost-user.h b/drivers/net/virtio/vhost-user.h
new file mode 100644
index 0000000..148e2b7
--- /dev/null
+++ b/drivers/net/virtio/vhost-user.h
@@ -0,0 +1,137 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _VHOST_NET_USER_H
+#define _VHOST_NET_USER_H
+
+#include <stdint.h>
+#include <linux/types.h>
+
+#define VHOST_MEMORY_MAX_NREGIONS 8
+
+struct vhost_vring_state {
+ unsigned int index;
+ unsigned int num;
+};
+
+struct vhost_vring_file {
+ unsigned int index;
+ int fd;
+};
+
+struct vhost_vring_addr {
+ unsigned int index;
+ /* Option flags. */
+ unsigned int flags;
+ /* Flag values: */
+ /* Whether log address is valid. If set enables logging. */
+#define VHOST_VRING_F_LOG 0
+
+ /* Start of array of descriptors (virtually contiguous) */
+ __u64 desc_user_addr;
+ /* Used structure address. Must be 32 bit aligned */
+ __u64 used_user_addr;
+ /* Available structure address. Must be 16 bit aligned */
+ __u64 avail_user_addr;
+ /* Logging support. */
+ /* Log writes to used structure, at offset calculated from specified
+ * address. Address must be 32 bit aligned. */
+ __u64 log_guest_addr;
+};
+
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+
+/* refer to hw/virtio/vhost-user.c */
+
+typedef enum VhostUserRequest {
+ VHOST_USER_NONE = 0,
+ VHOST_USER_GET_FEATURES = 1,
+ VHOST_USER_SET_FEATURES = 2,
+ VHOST_USER_SET_OWNER = 3,
+ VHOST_USER_RESET_OWNER = 4,
+ VHOST_USER_SET_MEM_TABLE = 5,
+ VHOST_USER_SET_LOG_BASE = 6,
+ VHOST_USER_SET_LOG_FD = 7,
+ VHOST_USER_SET_VRING_NUM = 8,
+ VHOST_USER_SET_VRING_ADDR = 9,
+ VHOST_USER_SET_VRING_BASE = 10,
+ VHOST_USER_GET_VRING_BASE = 11,
+ VHOST_USER_SET_VRING_KICK = 12,
+ VHOST_USER_SET_VRING_CALL = 13,
+ VHOST_USER_SET_VRING_ERR = 14,
+ VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+ VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+ VHOST_USER_GET_QUEUE_NUM = 17,
+ VHOST_USER_SET_VRING_ENABLE = 18,
+ VHOST_USER_MAX
+} VhostUserRequest;
+
+typedef struct VhostUserMemoryRegion {
+ uint64_t guest_phys_addr;
+ uint64_t memory_size;
+ uint64_t userspace_addr;
+ uint64_t mmap_offset;
+} VhostUserMemoryRegion;
+
+typedef struct VhostUserMemory {
+ uint32_t nregions;
+ uint32_t padding;
+ VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
+} VhostUserMemory;
+
+typedef struct VhostUserMsg {
+ VhostUserRequest request;
+
+#define VHOST_USER_VERSION_MASK 0x3
+#define VHOST_USER_REPLY_MASK (0x1 << 2)
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+#define VHOST_USER_VRING_IDX_MASK 0xff
+#define VHOST_USER_VRING_NOFD_MASK (0x1<<8)
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ VhostUserMemory memory;
+ } payload;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+} __attribute((packed)) VhostUserMsg;
+
+#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
+#define VHOST_USER_PAYLOAD_SIZE (sizeof(VhostUserMsg) - VHOST_USER_HDR_SIZE)
+
+/* The version of the protocol we support */
+#define VHOST_USER_VERSION 0x1
+
+/*****************************************************************************/
+#endif
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 47f722a..e33579f 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -147,7 +147,6 @@ struct virtqueue;
* rest are per-device feature bits.
*/
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32

/* The Guest publishes the used index for which it expects an interrupt
* at the end of the avail ring. Host should ignore the avail->flags field. */
@@ -174,6 +173,16 @@ struct virtio_hw {
uint8_t use_msix;
uint8_t started;
uint8_t mac_addr[ETHER_ADDR_LEN];
+#ifdef RTE_VIRTIO_VDEV
+ uint32_t queue_num;
+ struct rte_eth_dev_data *data;
+ char *path;
+ int sockfd;
+ int callfd;
+ int kickfd;
+ uint32_t queue_sel;
+ uint8_t status;
+#endif
};

/*
@@ -226,6 +235,25 @@ outl_p(unsigned int data, unsigned int port)
}
#endif

+#ifdef RTE_VIRTIO_VDEV
+uint32_t virtio_ioport_read(struct virtio_hw *, uint64_t);
+void virtio_ioport_write(struct virtio_hw *, uint64_t, uint32_t);
+
+#define VIRTIO_READ_REG_1(hw, reg) \
+ virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_1(hw, reg, value) \
+ virtio_ioport_write(hw, reg, value)
+#define VIRTIO_READ_REG_2(hw, reg) \
+ virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_2(hw, reg, value) \
+ virtio_ioport_write(hw, reg, value)
+#define VIRTIO_READ_REG_4(hw, reg) \
+ virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_4(hw, reg, value) \
+ virtio_ioport_write(hw, reg, value)
+
+#else /* RTE_VIRTIO_VDEV */
+
#define VIRTIO_PCI_REG_ADDR(hw, reg) \
(unsigned short)((hw)->io_base + (reg))

@@ -244,6 +272,8 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg))))

+#endif /* RTE_VIRTIO_VDEV */
+
static inline int
vtpci_with_feature(struct virtio_hw *hw, uint32_t bit)
{
--
2.1.4
Jianfeng Tan
2015-11-05 18:31:13 UTC
Permalink
Add a new virtual device named eth_cvio, it can be used just like
eth_ring, eth_null, etc. Configured paramters include number of rx,
tx, cq, path of vhost unix socket, and queue size. The major
difference with virtio for vm is that here we use virtual address
instead of physical address for vhost to calculate relative address.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 319 +++++++++++++++++++++++++++++--------
drivers/net/virtio/virtio_ethdev.h | 16 ++
drivers/net/virtio/virtqueue.h | 9 +-
3 files changed, 275 insertions(+), 69 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 6001108..b5e2126 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -56,6 +56,7 @@
#include <rte_memory.h>
#include <rte_eal.h>
#include <rte_dev.h>
+#include <rte_kvargs.h>

#include "virtio_ethdev.h"
#include "virtio_pci.h"
@@ -63,7 +64,6 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

-
static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
static int eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev);
static int virtio_dev_configure(struct rte_eth_dev *dev);
@@ -164,8 +164,7 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
if ((vq->vq_free_cnt < ((uint32_t)pkt_num + 2)) || (pkt_num < 1))
return -1;

- memcpy(vq->virtio_net_hdr_mz->addr, ctrl,
- sizeof(struct virtio_pmd_ctrl));
+ memcpy(vq->virtio_net_hdr_vaddr, ctrl, sizeof(struct virtio_pmd_ctrl));

/*
* Format is enforced in qemu code:
@@ -174,14 +173,14 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
* One RX packet for ACK.
*/
vq->vq_ring.desc[head].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mz->phys_addr;
+ vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mem;
vq->vq_ring.desc[head].len = sizeof(struct virtio_net_ctrl_hdr);
vq->vq_free_cnt--;
i = vq->vq_ring.desc[head].next;

for (k = 0; k < pkt_num; k++) {
vq->vq_ring.desc[i].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr)
+ sizeof(ctrl->status) + sizeof(uint8_t)*sum;
vq->vq_ring.desc[i].len = dlen[k];
@@ -191,7 +190,7 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
}

vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr);
vq->vq_ring.desc[i].len = sizeof(ctrl->status);
vq->vq_free_cnt--;
@@ -236,7 +235,7 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
PMD_INIT_LOG(DEBUG, "vq->vq_free_cnt=%d\nvq->vq_desc_head_idx=%d",
vq->vq_free_cnt, vq->vq_desc_head_idx);

- memcpy(&result, vq->virtio_net_hdr_mz->addr,
+ memcpy(&result, vq->virtio_net_hdr_vaddr,
sizeof(struct virtio_pmd_ctrl));

return result.status;
@@ -374,66 +373,79 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
}
}

- /*
- * Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
- * and only accepts 32 bit page frame number.
- * Check if the allocated physical memory exceeds 16TB.
- */
- if ((mz->phys_addr + vq->vq_ring_size - 1) >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
- PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
- rte_free(vq);
- return -ENOMEM;
- }
-
memset(mz->addr, 0, sizeof(mz->len));
vq->mz = mz;
- vq->vq_ring_mem = mz->phys_addr;
vq->vq_ring_virt_mem = mz->addr;
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64, (uint64_t)mz->phys_addr);
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, (uint64_t)(uintptr_t)mz->addr);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ vq->vq_ring_mem = mz->phys_addr;
+
+ /* Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
+ * and only accepts 32 bit page frame number.
+ * Check if the allocated physical memory exceeds 16TB.
+ */
+ uint64_t last_physaddr = vq->vq_ring_mem + vq->vq_ring_size - 1;
+ if (last_physaddr >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
+ PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
+ rte_free(vq);
+ return -ENOMEM;
+ }
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else { /* RTE_ETH_DEV_VIRTUAL */
+ /* Use virtual addr to fill!!! */
+ vq->vq_ring_mem = (phys_addr_t)mz->addr;
+
+ /* TODO: check last_physaddr */
+ }
+#endif
+
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64,
+ (uint64_t)vq->vq_ring_mem);
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
+ (uint64_t)(uintptr_t)vq->vq_ring_virt_mem);
vq->virtio_net_hdr_mz = NULL;
vq->virtio_net_hdr_mem = 0;

+ uint64_t hdr_size = 0;
if (queue_type == VTNET_TQ) {
/*
* For each xmit packet, allocate a virtio_net_hdr
*/
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d_hdrzone",
- dev->data->port_id, queue_idx);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- vq_size * hw->vtnet_hdr_size,
- socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
- if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
- rte_free(vq);
- return -ENOMEM;
- }
- }
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0,
- vq_size * hw->vtnet_hdr_size);
+ dev->data->port_id, queue_idx);
+ hdr_size = vq_size * hw->vtnet_hdr_size;
} else if (queue_type == VTNET_CQ) {
- /* Allocate a page for control vq command, data and status */
snprintf(vq_name, sizeof(vq_name), "port%d_cvq_hdrzone",
- dev->data->port_id);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- PAGE_SIZE, socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
+ dev->data->port_id);
+ /* Allocate a page for control vq command, data and status */
+ hdr_size = PAGE_SIZE;
+ }
+
+ if (hdr_size) { /* queue_type is VTNET_TQ or VTNET_CQ */
+ mz = rte_memzone_reserve_aligned(vq_name,
+ hdr_size, socket_id, 0, RTE_CACHE_LINE_SIZE);
+ if (mz == NULL) {
if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
+ mz = rte_memzone_lookup(vq_name);
+ if (mz == NULL) {
rte_free(vq);
return -ENOMEM;
}
}
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0, PAGE_SIZE);
+ vq->virtio_net_hdr_mz = mz;
+ vq->virtio_net_hdr_vaddr = mz->addr;
+ memset(vq->virtio_net_hdr_vaddr, 0, hdr_size);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ vq->virtio_net_hdr_mem = mz->phys_addr;
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ /* Use vaddr!!! */
+ vq->virtio_net_hdr_mem = (phys_addr_t)mz->addr;
+ }
+#endif
}

/*
@@ -491,8 +503,10 @@ virtio_dev_close(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "virtio_dev_close");

/* reset the NIC */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ }
vtpci_reset(hw);
hw->started = 0;
virtio_dev_free_mbufs(dev);
@@ -1288,11 +1302,18 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
}

pci_dev = eth_dev->pci_dev;
- if (virtio_resource_init(pci_dev) < 0)
- return -1;
-
- hw->use_msix = virtio_has_msix(&pci_dev->addr);
- hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (virtio_resource_init(pci_dev) < 0)
+ return -1;
+ hw->use_msix = virtio_has_msix(&pci_dev->addr);
+ hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ hw->use_msix = 0;
+ hw->io_base = 0;
+ }
+#endif

/* Reset the device although not necessary at startup */
vtpci_reset(hw);
@@ -1305,8 +1326,10 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
virtio_negotiate_features(hw);

/* If host does not support status then disable LSC */
- if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
- pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
+ pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+ }

rx_func_get(eth_dev);

@@ -1385,12 +1408,12 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
eth_dev->data->port_id, pci_dev->id.vendor_id,
pci_dev->id.device_id);
-
- /* Setup interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_register(&pci_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
-
+ /* Setup interrupt callback */
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_register(&pci_dev->intr_handle,
+ virtio_interrupt_handler, eth_dev);
+ }
virtio_dev_cq_start(eth_dev);

return 0;
@@ -1423,10 +1446,12 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
eth_dev->data->mac_addrs = NULL;

/* reset interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_unregister(&pci_dev->intr_handle,
- virtio_interrupt_handler,
- eth_dev);
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_unregister(&pci_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev);
+ }

PMD_INIT_LOG(DEBUG, "dev_uninit completed");

@@ -1481,6 +1506,11 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return (-EINVAL);
}

+#ifdef RTE_VIRTIO_VDEV
+ if (dev->dev_type == RTE_ETH_DEV_VIRTUAL)
+ return 0;
+#endif
+
hw->vlan_strip = rxmode->hw_vlan_strip;

if (rxmode->hw_vlan_filter
@@ -1688,3 +1718,156 @@ static struct rte_driver rte_virtio_driver = {
};

PMD_REGISTER_DRIVER(rte_virtio_driver);
+
+#ifdef RTE_VIRTIO_VDEV
+
+#define ETH_CVIO_ARG_RX_NUM "rx"
+#define ETH_CVIO_ARG_TX_NUM "tx"
+#define ETH_CVIO_ARG_CQ_NUM "cq"
+#define ETH_CVIO_ARG_SK_PATH "path"
+#define ETH_CVIO_ARG_QUEUE_SIZE "queue_num"
+/*TODO: specify mac addr */
+static const char *valid_args[] = {
+ ETH_CVIO_ARG_RX_NUM,
+ ETH_CVIO_ARG_TX_NUM,
+ ETH_CVIO_ARG_CQ_NUM,
+ ETH_CVIO_ARG_SK_PATH,
+ ETH_CVIO_ARG_QUEUE_SIZE,
+ NULL
+};
+
+static int
+get_string_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ if ((value == NULL) || (extra_args == NULL))
+ return -EINVAL;
+
+ strcpy(extra_args, value);
+
+ return 0;
+}
+
+static int
+get_integer_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ uint64_t *p_u64 = extra_args;
+
+ if ((value == NULL) || (extra_args == NULL))
+ return -EINVAL;
+
+ *p_u64 = (uint64_t)strtoull(value, NULL, 0);
+
+ return 0;
+}
+
+static struct rte_eth_dev *
+cvio_eth_dev_alloc(const char *name)
+{
+ struct rte_eth_dev *eth_dev;
+ struct rte_eth_dev_data *data;
+ struct rte_pci_device *pci_dev;
+ struct virtio_hw *hw;
+
+ eth_dev = rte_eth_dev_allocate(name, RTE_ETH_DEV_VIRTUAL);
+ if (eth_dev == NULL)
+ rte_panic("cannot alloc rte_eth_dev\n");
+
+ data = eth_dev->data;
+
+ pci_dev = rte_zmalloc(NULL, sizeof(*pci_dev), 0);
+ if (!pci_dev)
+ rte_panic("cannot alloc pci_dev\n");
+ hw = rte_zmalloc(NULL, sizeof(*hw), 0);
+ if (!hw)
+ rte_panic("malloc virtio_hw failed\n");
+
+ data->dev_private = hw;
+ pci_dev->numa_node = SOCKET_ID_ANY;
+ /* TODO: should remove pci_dev after Bernard Iremonger's patch applied */
+ eth_dev->pci_dev = pci_dev;
+ /* will be used in virtio_dev_info_get() */
+ eth_dev->driver = &rte_virtio_pmd;
+ /* TAILQ_INIT(&(eth_dev->link_intr_cbs)); */
+ return eth_dev;
+}
+
+/*
+ * Dev initialization routine.
+ * Invoked once for each virtio vdev at EAL init time,
+ * See rte_eal_dev_init().
+ * Returns 0 on success.
+ */
+static int
+rte_cvio_pmd_devinit(const char *name, const char *params)
+{
+ struct rte_kvargs *kvlist = NULL;
+ struct rte_eth_dev *eth_dev = NULL;
+ uint64_t nb_rx = 1, nb_tx = 1, nb_cq = 0, queue_num = 256;
+ char sock_path[256];
+
+ if (params == NULL || params[0] == '\0') {
+ rte_panic("param is null\n");
+ }
+
+ kvlist = rte_kvargs_parse(params, valid_args);
+ if (!kvlist)
+ rte_panic("error when parsing param\n");
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_SK_PATH) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_SK_PATH,
+ &get_string_arg, sock_path);
+ } else {
+ rte_panic("no arg: %s\n", ETH_CVIO_ARG_SK_PATH);
+ }
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_QUEUE_SIZE) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_QUEUE_SIZE,
+ &get_integer_arg, &queue_num);
+ }
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_RX_NUM) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_RX_NUM,
+ &get_integer_arg, &nb_rx);
+ }
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_TX_NUM) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_TX_NUM,
+ &get_integer_arg, &nb_tx);
+ }
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_CQ_NUM) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_CQ_NUM,
+ &get_integer_arg, &nb_cq);
+ }
+
+ eth_dev = cvio_eth_dev_alloc(name);
+
+ virtio_vdev_init(eth_dev->data, sock_path,
+ nb_rx, nb_tx, nb_cq, queue_num);
+
+ /* originally, this will be called in rte_eal_pci_probe() */
+ eth_virtio_dev_init(eth_dev);
+
+ return 0;
+}
+
+static int
+rte_cvio_pmd_devuninit(const char *name)
+{
+ /* TODO: if it's last one, memory init, free memory */
+ rte_panic("%s", name);
+ return 0;
+}
+
+static struct rte_driver rte_cvio_driver = {
+ .name = "eth_cvio",
+ .type = PMD_VDEV,
+ .init = rte_cvio_pmd_devinit,
+ .uninit = rte_cvio_pmd_devuninit,
+};
+
+PMD_REGISTER_DRIVER(rte_cvio_driver);
+
+#endif
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index ae2d47d..25613ac 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -56,6 +56,17 @@
#define VIRTIO_MAX_RX_PKTLEN 9728

/* Features desired/implemented by this driver. */
+#ifdef RTE_VIRTIO_VDEV
+/* use random mac addr for now */
+/* control queue not available for now */
+#define VIRTIO_PMD_GUEST_FEATURES \
+ (1u << VIRTIO_NET_F_STATUS | \
+ 1u << VIRTIO_NET_F_MQ | \
+ 1u << VIRTIO_NET_F_CTRL_MAC_ADDR | \
+ 1u << VIRTIO_NET_F_CTRL_RX | \
+ 1u << VIRTIO_NET_F_CTRL_VLAN | \
+ 1u << VIRTIO_NET_F_MRG_RXBUF)
+#else
#define VIRTIO_PMD_GUEST_FEATURES \
(1u << VIRTIO_NET_F_MAC | \
1u << VIRTIO_NET_F_STATUS | \
@@ -65,6 +76,7 @@
1u << VIRTIO_NET_F_CTRL_RX | \
1u << VIRTIO_NET_F_CTRL_VLAN | \
1u << VIRTIO_NET_F_MRG_RXBUF)
+#endif

/*
* CQ function prototype
@@ -122,5 +134,9 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)

+#ifdef RTE_VIRTIO_VDEV
+int virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq, int queue_num);
+#endif

#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 689c321..7eb4187 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -68,8 +68,13 @@ struct rte_mbuf;

#define VIRTQUEUE_MAX_NAME_SZ 32

+#ifdef RTE_VIRTIO_VDEV
+#define RTE_MBUF_DATA_DMA_ADDR(mb) \
+ ((uint64_t)(mb)->buf_addr + (mb)->data_off)
+#else
#define RTE_MBUF_DATA_DMA_ADDR(mb) \
(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
+#endif /* RTE_VIRTIO_VDEV */

#define VTNET_SQ_RQ_QUEUE_IDX 0
#define VTNET_SQ_TQ_QUEUE_IDX 1
@@ -169,7 +174,8 @@ struct virtqueue {

void *vq_ring_virt_mem; /**< linear address of vring*/
unsigned int vq_ring_size;
- phys_addr_t vq_ring_mem; /**< physical address of vring */
+ phys_addr_t vq_ring_mem; /**< physical address of vring for non-vdev,
+ virtual addr of vring for vdev*/

struct vring vq_ring; /**< vring keeping desc, used and avail */
uint16_t vq_free_cnt; /**< num of desc available */
@@ -190,6 +196,7 @@ struct virtqueue {
uint16_t vq_avail_idx;
uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */
+ void *virtio_net_hdr_vaddr; /**< linear address of vring*/

struct rte_mbuf **sw_ring; /**< RX software ring. */
/* dummy mbuf, for wraparound when processing RX ring. */
--
2.1.4
Jianfeng Tan
2015-11-05 18:31:14 UTC
Permalink
Unify desc->addr assignment using RTE_MBUF_DATA_DMA_ADDR. virtio
for vm uses physical address, while virtio for container uses
virtual address.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 9 ++++-----
drivers/net/virtio/virtio_rxtx_simple.c | 9 ++++-----
2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5770fa2..1cfb2b9 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -191,8 +191,7 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)

start_dp = vq->vq_ring.desc;
start_dp[idx].addr =
- (uint64_t)(cookie->buf_physaddr + RTE_PKTMBUF_HEADROOM
- - hw->vtnet_hdr_size);
+ RTE_MBUF_DATA_DMA_ADDR(cookie) - hw->vtnet_hdr_size;
start_dp[idx].len =
cookie->buf_len - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
start_dp[idx].flags = VRING_DESC_F_WRITE;
@@ -343,7 +342,7 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
if (use_simple_rxtx) {
int mid_idx = vq->vq_nentries >> 1;
@@ -366,12 +365,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else {
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
}
}

diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ff3c11a..d1bb4c4 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -80,8 +80,8 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
vq->sw_ring[desc_idx] = cookie;

start_dp = vq->vq_ring.desc;
- start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie)
+ - sizeof(struct virtio_net_hdr);
start_dp[desc_idx].len = cookie->buf_len -
RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);

@@ -118,9 +118,8 @@ virtio_rxq_rearm_vec(struct virtqueue *rxvq)
p = (uintptr_t)&sw_ring[i]->rearm_data;
*(uint64_t *)p = rxvq->mbuf_initializer;

- start_dp[i].addr =
- (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].addr = RTE_MBUF_DATA_DMA_ADDR(sw_ring[i])
+ - sizeof(struct virtio_net_hdr);
start_dp[i].len = sw_ring[i]->buf_len -
RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
}
--
2.1.4
Jianfeng Tan
2015-11-05 18:31:15 UTC
Permalink
When using virtio for container, we should specify --no-huge so
that in memory initialization, shm_open() is used to alloc memory
from tmpfs filesystem /dev/shm/.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
lib/librte_eal/common/include/rte_memory.h | 5 +++
lib/librte_eal/linuxapp/eal/eal_memory.c | 58 ++++++++++++++++++++++++++++--
lib/librte_mempool/rte_mempool.c | 16 ++++-----
3 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 1bed415..9c1effc 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -100,6 +100,7 @@ struct rte_memseg {
int32_t socket_id; /**< NUMA socket ID. */
uint32_t nchannel; /**< Number of channels. */
uint32_t nrank; /**< Number of ranks. */
+ int fd; /**< fd used for share this memory */
#ifdef RTE_LIBRTE_XEN_DOM0
/**< store segment MFNs */
uint64_t mfn[DOM0_NUM_MEMBLOCK];
@@ -128,6 +129,10 @@ int rte_mem_lock_page(const void *virt);
*/
phys_addr_t rte_mem_virt2phy(const void *virt);

+
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void **paddr);
+
/**
* Get the layout of the available physical memory.
*
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index ac2745e..9abbfc6 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -80,6 +80,9 @@
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/time.h>
+#include <mntent.h>
+#include <sys/mman.h>
+#include <sys/file.h>

#include <rte_log.h>
#include <rte_memory.h>
@@ -143,6 +146,18 @@ rte_mem_lock_page(const void *virt)
return mlock((void*)aligned, page_size);
}

+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void **paddr)
+{
+ struct rte_mem_config *mcfg;
+ mcfg = rte_eal_get_configuration()->mem_config;
+
+ *pfd = mcfg->memseg[index].fd;
+ *psize = (uint64_t)mcfg->memseg[index].len;
+ *paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
+ return 0;
+}
+
/*
* Get physical address of any mapped virtual address in the current process.
*/
@@ -1044,6 +1059,42 @@ calc_num_pages_per_socket(uint64_t * memory,
return total_num_pages;
}

+static void *
+rte_eal_shm_create(int *pfd)
+{
+ int ret, fd;
+ char filepath[256];
+ void *vaddr;
+ uint64_t size = internal_config.memory;
+
+ sprintf(filepath, "/%s_cvio", internal_config.hugefile_prefix);
+
+ fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ rte_panic("shm_open %s failed: %s\n", filepath, strerror(errno));
+ }
+ ret = flock(fd, LOCK_EX);
+ if (ret < 0) {
+ close(fd);
+ rte_panic("flock %s failed: %s\n", filepath, strerror(errno));
+ }
+
+ ret = ftruncate(fd, size);
+ if (ret < 0) {
+ rte_panic("ftruncate failed: %s\n", strerror(errno));
+ }
+ /* flag: MAP_HUGETLB */
+ vaddr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (vaddr == MAP_FAILED) {
+ rte_panic("mmap failed: %s\n", strerror(errno));
+ }
+ memset(vaddr, 0, size);
+ *pfd = fd;
+
+ return vaddr;
+}
+
+
/*
* Prepare physical memory mapping: fill configuration structure with
* these infos, return 0 on success.
@@ -1072,7 +1123,9 @@ rte_eal_hugepage_init(void)
int new_pages_count[MAX_HUGEPAGE_SIZES];
#endif

+#ifndef RTE_VIRTIO_VDEV
test_proc_pagemap_readable();
+#endif

memset(used_hp, 0, sizeof(used_hp));

@@ -1081,8 +1134,8 @@ rte_eal_hugepage_init(void)

/* hugetlbfs can be disabled */
if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ int fd;
+ addr = rte_eal_shm_create(&fd);
if (addr == MAP_FAILED) {
RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
strerror(errno));
@@ -1093,6 +1146,7 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].fd = fd;
return 0;
}

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index e57cbbd..8f8852b 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -453,13 +453,6 @@ rte_mempool_xmem_create(const char *name, unsigned n, unsigned elt_size,
rte_errno = EINVAL;
return NULL;
}
-
- /* check that we have both VA and PA */
- if (vaddr != NULL && paddr == NULL) {
- rte_errno = EINVAL;
- return NULL;
- }
-
/* Check that pg_num and pg_shift parameters are valid. */
if (pg_num < RTE_DIM(mp->elt_pa) || pg_shift > MEMPOOL_PG_SHIFT_MAX) {
rte_errno = EINVAL;
@@ -596,8 +589,15 @@ rte_mempool_xmem_create(const char *name, unsigned n, unsigned elt_size,

/* mempool elements in a separate chunk of memory. */
} else {
+ /* when VA is specified, PA should be specified? */
+ if (rte_eal_has_hugepages()) {
+ if (paddr == NULL) {
+ rte_errno = EINVAL;
+ return NULL;
+ }
+ memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) * pg_num);
+ }
mp->elt_va_start = (uintptr_t)vaddr;
- memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) * pg_num);
}

mp->elt_va_end = mp->elt_va_start;
--
2.1.4
Ananyev, Konstantin
2015-11-06 16:21:55 UTC
Permalink
Hi,
-----Original Message-----
Sent: Thursday, November 05, 2015 6:31 PM
Subject: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory initialization process
When using virtio for container, we should specify --no-huge so
that in memory initialization, shm_open() is used to alloc memory
from tmpfs filesystem /dev/shm/.
---
lib/librte_eal/common/include/rte_memory.h | 5 +++
lib/librte_eal/linuxapp/eal/eal_memory.c | 58 ++++++++++++++++++++++++++++--
lib/librte_mempool/rte_mempool.c | 16 ++++-----
3 files changed, 69 insertions(+), 10 deletions(-)
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 1bed415..9c1effc 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -100,6 +100,7 @@ struct rte_memseg {
int32_t socket_id; /**< NUMA socket ID. */
uint32_t nchannel; /**< Number of channels. */
uint32_t nrank; /**< Number of ranks. */
+ int fd; /**< fd used for share this memory */
#ifdef RTE_LIBRTE_XEN_DOM0
/**< store segment MFNs */
uint64_t mfn[DOM0_NUM_MEMBLOCK];
@@ -128,6 +129,10 @@ int rte_mem_lock_page(const void *virt);
*/
phys_addr_t rte_mem_virt2phy(const void *virt);
+
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void **paddr);
+
/**
* Get the layout of the available physical memory.
*
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index ac2745e..9abbfc6 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -80,6 +80,9 @@
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/time.h>
+#include <mntent.h>
+#include <sys/mman.h>
+#include <sys/file.h>
#include <rte_log.h>
#include <rte_memory.h>
@@ -143,6 +146,18 @@ rte_mem_lock_page(const void *virt)
return mlock((void*)aligned, page_size);
}
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void **paddr)
+{
+ struct rte_mem_config *mcfg;
+ mcfg = rte_eal_get_configuration()->mem_config;
+
+ *pfd = mcfg->memseg[index].fd;
+ *psize = (uint64_t)mcfg->memseg[index].len;
+ *paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
+ return 0;
+}
Wonder who will use that function?
Can't see any references to that function in that patch or next.
+
/*
* Get physical address of any mapped virtual address in the current process.
*/
@@ -1044,6 +1059,42 @@ calc_num_pages_per_socket(uint64_t * memory,
return total_num_pages;
}
+static void *
+rte_eal_shm_create(int *pfd)
+{
+ int ret, fd;
+ char filepath[256];
+ void *vaddr;
+ uint64_t size = internal_config.memory;
+
+ sprintf(filepath, "/%s_cvio", internal_config.hugefile_prefix);
+
+ fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ rte_panic("shm_open %s failed: %s\n", filepath, strerror(errno));
+ }
+ ret = flock(fd, LOCK_EX);
+ if (ret < 0) {
+ close(fd);
+ rte_panic("flock %s failed: %s\n", filepath, strerror(errno));
+ }
+
+ ret = ftruncate(fd, size);
+ if (ret < 0) {
+ rte_panic("ftruncate failed: %s\n", strerror(errno));
+ }
+ /* flag: MAP_HUGETLB */
Any explanation what that comment means here?
Do you plan to use MAP_HUGETLb in the call below or ...?
+ vaddr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (vaddr == MAP_FAILED) {
+ rte_panic("mmap failed: %s\n", strerror(errno));
+ }
+ memset(vaddr, 0, size);
+ *pfd = fd;
+
+ return vaddr;
+}
+
+
/*
* Prepare physical memory mapping: fill configuration structure with
* these infos, return 0 on success.
@@ -1072,7 +1123,9 @@ rte_eal_hugepage_init(void)
int new_pages_count[MAX_HUGEPAGE_SIZES];
#endif
+#ifndef RTE_VIRTIO_VDEV
test_proc_pagemap_readable();
+#endif
memset(used_hp, 0, sizeof(used_hp));
@@ -1081,8 +1134,8 @@ rte_eal_hugepage_init(void)
/* hugetlbfs can be disabled */
if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ int fd;
+ addr = rte_eal_shm_create(&fd);
Why do you remove ability to map(dev/zero) here?
Probably not everyone plan to use --no-hugepages only inside containers.
if (addr == MAP_FAILED) {
RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
strerror(errno));
@@ -1093,6 +1146,7 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].fd = fd;
return 0;
}
diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index e57cbbd..8f8852b 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -453,13 +453,6 @@ rte_mempool_xmem_create(const char *name, unsigned n, unsigned elt_size,
rte_errno = EINVAL;
return NULL;
}
-
- /* check that we have both VA and PA */
- if (vaddr != NULL && paddr == NULL) {
- rte_errno = EINVAL;
- return NULL;
- }
-
/* Check that pg_num and pg_shift parameters are valid. */
if (pg_num < RTE_DIM(mp->elt_pa) || pg_shift > MEMPOOL_PG_SHIFT_MAX) {
rte_errno = EINVAL;
@@ -596,8 +589,15 @@ rte_mempool_xmem_create(const char *name, unsigned n, unsigned elt_size,
/* mempool elements in a separate chunk of memory. */
} else {
+ /* when VA is specified, PA should be specified? */
+ if (rte_eal_has_hugepages()) {
+ if (paddr == NULL) {
+ rte_errno = EINVAL;
+ return NULL;
+ }
+ memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) * pg_num);
+ }
mp->elt_va_start = (uintptr_t)vaddr;
- memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) * pg_num);
Could you explain the reason for that change?
Specially why mempool over external memory now only allowed for hugepages config?
Konstantin
}
mp->elt_va_end = mp->elt_va_start;
--
2.1.4
Tan, Jianfeng
2015-11-08 11:18:12 UTC
Permalink
-----Original Message-----
From: Ananyev, Konstantin
Sent: Saturday, November 7, 2015 12:22 AM
Subject: RE: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory
initialization process
Hi,
-----Original Message-----
Sent: Thursday, November 05, 2015 6:31 PM
Subject: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory initialization process
When using virtio for container, we should specify --no-huge so that
in memory initialization, shm_open() is used to alloc memory from
tmpfs filesystem /dev/shm/.
---
......
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void
+**paddr) {
+ struct rte_mem_config *mcfg;
+ mcfg = rte_eal_get_configuration()->mem_config;
+
+ *pfd = mcfg->memseg[index].fd;
+ *psize = (uint64_t)mcfg->memseg[index].len;
+ *paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
+ return 0;
+}
Wonder who will use that function?
Can't see any references to that function in that patch or next.
This function is used in 1/5, when virtio front end needs to send VHOST_USER_SET_MEM_TABLE to back end.
+
/*
* Get physical address of any mapped virtual address in the current
process.
*/
@@ -1044,6 +1059,42 @@ calc_num_pages_per_socket(uint64_t *
memory,
return total_num_pages;
}
+static void *
+rte_eal_shm_create(int *pfd)
+{
+ int ret, fd;
+ char filepath[256];
+ void *vaddr;
+ uint64_t size = internal_config.memory;
+
+ sprintf(filepath, "/%s_cvio", internal_config.hugefile_prefix);
+
+ fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ rte_panic("shm_open %s failed: %s\n", filepath,
strerror(errno));
+ }
+ ret = flock(fd, LOCK_EX);
+ if (ret < 0) {
+ close(fd);
+ rte_panic("flock %s failed: %s\n", filepath, strerror(errno));
+ }
+
+ ret = ftruncate(fd, size);
+ if (ret < 0) {
+ rte_panic("ftruncate failed: %s\n", strerror(errno));
+ }
+ /* flag: MAP_HUGETLB */
Any explanation what that comment means here?
Do you plan to use MAP_HUGETLb in the call below or ...?
Yes, it's a todo item. Shm_open() just uses a tmpfs mounted at /dev/shm. So I wonder maybe we can use this flag to make
sure os allocates hugepages here if user would like to use hugepages.
......
@@ -1081,8 +1134,8 @@ rte_eal_hugepage_init(void)
/* hugetlbfs can be disabled */
if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ |
PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ int fd;
+ addr = rte_eal_shm_create(&fd);
Why do you remove ability to map(dev/zero) here?
Probably not everyone plan to use --no-hugepages only inside containers.
From my understanding, mmap here is just to allocate some memory, which is initialized to be all zero. I cannot understand what's
the relationship with /dev/zero. rte_eal_shm_create() can do the original function, plus it generates a fd to point to this chunk of
memory. This fd is indispensable in vhost protocol when VHOST_USER_SET_MEM_TABLE using sendmsg().
if (addr == MAP_FAILED) {
RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
__func__,
strerror(errno));
@@ -1093,6 +1146,7 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].fd = fd;
return 0;
}
diff --git a/lib/librte_mempool/rte_mempool.c
b/lib/librte_mempool/rte_mempool.c
index e57cbbd..8f8852b 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -453,13 +453,6 @@ rte_mempool_xmem_create(const char *name,
unsigned n, unsigned elt_size,
rte_errno = EINVAL;
return NULL;
}
-
- /* check that we have both VA and PA */
- if (vaddr != NULL && paddr == NULL) {
- rte_errno = EINVAL;
- return NULL;
- }
-
/* Check that pg_num and pg_shift parameters are valid. */
if (pg_num < RTE_DIM(mp->elt_pa) || pg_shift >
MEMPOOL_PG_SHIFT_MAX) {
rte_errno = EINVAL;
@@ -596,8 +589,15 @@ rte_mempool_xmem_create(const char *name,
unsigned n, unsigned elt_size,
/* mempool elements in a separate chunk of memory. */
} else {
+ /* when VA is specified, PA should be specified? */
+ if (rte_eal_has_hugepages()) {
+ if (paddr == NULL) {
+ rte_errno = EINVAL;
+ return NULL;
+ }
+ memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0])
* pg_num);
+ }
mp->elt_va_start = (uintptr_t)vaddr;
- memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) *
pg_num);
Could you explain the reason for that change?
Specially why mempool over external memory now only allowed for hugepages config?
Konstantin
Oops, you're right! This change was previously for creating a mbuf mempool at a given vaddr and without
giving any paddr[]. And now we don't need to care about neither vaddr nor paddr[] so I should have reverted
change in this file.
}
mp->elt_va_end = mp->elt_va_start;
--
2.1.4
Ananyev, Konstantin
2015-11-09 13:32:29 UTC
Permalink
Post by Tan, Jianfeng
-----Original Message-----
From: Ananyev, Konstantin
Sent: Saturday, November 7, 2015 12:22 AM
Subject: RE: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory
initialization process
Hi,
-----Original Message-----
Sent: Thursday, November 05, 2015 6:31 PM
Subject: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory
initialization process
When using virtio for container, we should specify --no-huge so that
in memory initialization, shm_open() is used to alloc memory from
tmpfs filesystem /dev/shm/.
---
......
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void
+**paddr) {
+ struct rte_mem_config *mcfg;
+ mcfg = rte_eal_get_configuration()->mem_config;
+
+ *pfd = mcfg->memseg[index].fd;
+ *psize = (uint64_t)mcfg->memseg[index].len;
+ *paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
+ return 0;
+}
Wonder who will use that function?
Can't see any references to that function in that patch or next.
This function is used in 1/5, when virtio front end needs to send VHOST_USER_SET_MEM_TABLE to back end.
Ok, but hen this function should be defined in the patch *before* it is used, not after.
Another thing: probably better to create a struct for all memseg parameters you want to retrieve,
and pass it to the function, instead of several pointers.
Post by Tan, Jianfeng
+
/*
* Get physical address of any mapped virtual address in the current
process.
*/
@@ -1044,6 +1059,42 @@ calc_num_pages_per_socket(uint64_t *
memory,
return total_num_pages;
}
+static void *
+rte_eal_shm_create(int *pfd)
+{
+ int ret, fd;
+ char filepath[256];
+ void *vaddr;
+ uint64_t size = internal_config.memory;
+
+ sprintf(filepath, "/%s_cvio", internal_config.hugefile_prefix);
+
+ fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ rte_panic("shm_open %s failed: %s\n", filepath,
strerror(errno));
+ }
+ ret = flock(fd, LOCK_EX);
+ if (ret < 0) {
+ close(fd);
+ rte_panic("flock %s failed: %s\n", filepath, strerror(errno));
+ }
+
+ ret = ftruncate(fd, size);
+ if (ret < 0) {
+ rte_panic("ftruncate failed: %s\n", strerror(errno));
+ }
+ /* flag: MAP_HUGETLB */
Any explanation what that comment means here?
Do you plan to use MAP_HUGETLb in the call below or ...?
Yes, it's a todo item. Shm_open() just uses a tmpfs mounted at /dev/shm. So I wonder maybe we can use this flag to make
sure os allocates hugepages here if user would like to use hugepages.
......
@@ -1081,8 +1134,8 @@ rte_eal_hugepage_init(void)
/* hugetlbfs can be disabled */
if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ |
PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ int fd;
+ addr = rte_eal_shm_create(&fd);
Why do you remove ability to map(dev/zero) here?
Probably not everyone plan to use --no-hugepages only inside containers.
From my understanding, mmap here is just to allocate some memory, which is initialized to be all zero. I cannot understand what's
the relationship with /dev/zero.
I used it here as a synonym for mmap(, ..., MAP_ANONYMOUS,...).

rte_eal_shm_create() can do the original function, plus it generates a fd to point to this chunk of
Post by Tan, Jianfeng
memory. This fd is indispensable in vhost protocol when VHOST_USER_SET_MEM_TABLE using sendmsg().
My question was:
Right now for --no-hugepages it allocates a chunk of memory that is not backed-up by any file and is private to the process:

addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

You changed it to shared memory region allocation:

fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

I understand that you need it for your containers stuff - but I suppose you have to add
new functionality without breaking existing one>
There could be other users of --no-hugepages and they probably want existing behaviour.
Konstantin
Post by Tan, Jianfeng
if (addr == MAP_FAILED) {
RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
__func__,
strerror(errno));
@@ -1093,6 +1146,7 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].fd = fd;
return 0;
}
diff --git a/lib/librte_mempool/rte_mempool.c
b/lib/librte_mempool/rte_mempool.c
index e57cbbd..8f8852b 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -453,13 +453,6 @@ rte_mempool_xmem_create(const char *name,
unsigned n, unsigned elt_size,
rte_errno = EINVAL;
return NULL;
}
-
- /* check that we have both VA and PA */
- if (vaddr != NULL && paddr == NULL) {
- rte_errno = EINVAL;
- return NULL;
- }
-
/* Check that pg_num and pg_shift parameters are valid. */
if (pg_num < RTE_DIM(mp->elt_pa) || pg_shift >
MEMPOOL_PG_SHIFT_MAX) {
rte_errno = EINVAL;
@@ -596,8 +589,15 @@ rte_mempool_xmem_create(const char *name,
unsigned n, unsigned elt_size,
/* mempool elements in a separate chunk of memory. */
} else {
+ /* when VA is specified, PA should be specified? */
+ if (rte_eal_has_hugepages()) {
+ if (paddr == NULL) {
+ rte_errno = EINVAL;
+ return NULL;
+ }
+ memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0])
* pg_num);
+ }
mp->elt_va_start = (uintptr_t)vaddr;
- memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) *
pg_num);
Could you explain the reason for that change?
Specially why mempool over external memory now only allowed for hugepages config?
Konstantin
Oops, you're right! This change was previously for creating a mbuf mempool at a given vaddr and without
giving any paddr[]. And now we don't need to care about neither vaddr nor paddr[] so I should have reverted
change in this file.
}
mp->elt_va_end = mp->elt_va_start;
--
2.1.4
Tan, Jianfeng
2015-11-09 14:13:40 UTC
Permalink
Post by Tan, Jianfeng
......
Post by Tan, Jianfeng
Post by Ananyev, Konstantin
Post by Jianfeng Tan
+int
+rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void
+**paddr) {
+ struct rte_mem_config *mcfg;
+ mcfg = rte_eal_get_configuration()->mem_config;
+
+ *pfd = mcfg->memseg[index].fd;
+ *psize = (uint64_t)mcfg->memseg[index].len;
+ *paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
+ return 0;
+}
Wonder who will use that function?
Can't see any references to that function in that patch or next.
This function is used in 1/5, when virtio front end needs to send
VHOST_USER_SET_MEM_TABLE to back end.
Ok, but hen this function should be defined in the patch *before* it is used, not after.
Another thing: probably better to create a struct for all memseg parameters
you want to retrieve, and pass it to the function, instead of several pointers.
Very good suggestion! I'll fix it in next version.
Post by Tan, Jianfeng
Post by Tan, Jianfeng
Post by Ananyev, Konstantin
Post by Jianfeng Tan
+ addr = rte_eal_shm_create(&fd);
Why do you remove ability to map(dev/zero) here?
Probably not everyone plan to use --no-hugepages only inside containers.
From my understanding, mmap here is just to allocate some memory,
which is initialized to be all zero. I cannot understand what's the
relationship with /dev/zero.
I used it here as a synonym for mmap(, ..., MAP_ANONYMOUS,...).
rte_eal_shm_create() can do the original function, plus it generates a fd to
point to this chunk of
Post by Tan, Jianfeng
memory. This fd is indispensable in vhost protocol when
VHOST_USER_SET_MEM_TABLE using sendmsg().
Right now for --no-hugepages it allocates a chunk of memory that is not
addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR); addr =
mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
I understand that you need it for your containers stuff - but I suppose you
have to add new functionality without breaking existing one> There could be
other users of --no-hugepages and they probably want existing behaviour.
Konstantin
Thank you for patient analysis and I agree with you. I should have not break
compatibility with existing applications. I'd like to redesign this in next version.
Maybe a new cmd option is necessary here.

Jianfeng

.....
Post by Tan, Jianfeng
Post by Tan, Jianfeng
Post by Ananyev, Konstantin
Post by Jianfeng Tan
--
2.1.4
Jianfeng Tan
2015-11-05 18:31:16 UTC
Permalink
Change vhost listening socket mode so that users in groups and
others can connect to vhost listening socket.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
lib/librte_vhost/vhost_user/vhost-net-user.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 2dc0547..7b24f7c 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -42,6 +42,7 @@
#include <sys/un.h>
#include <errno.h>
#include <pthread.h>
+#include <sys/stat.h>

#include <rte_log.h>
#include <rte_virtio_net.h>
@@ -137,6 +138,10 @@ uds_socket(const char *path)
if (ret == -1)
goto err;

+ ret = chmod(un.sun_path, 0666);
+ if (ret == 0)
+ RTE_LOG(INFO, VHOST_CONFIG, "chmod 0666, ok\n");
+
return sockfd;

err:
--
2.1.4
Yuanhan Liu
2015-11-09 03:54:34 UTC
Permalink
Post by Jianfeng Tan
Change vhost listening socket mode so that users in groups and
others can connect to vhost listening socket.
---
lib/librte_vhost/vhost_user/vhost-net-user.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 2dc0547..7b24f7c 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -42,6 +42,7 @@
#include <sys/un.h>
#include <errno.h>
#include <pthread.h>
+#include <sys/stat.h>
#include <rte_log.h>
#include <rte_virtio_net.h>
@@ -137,6 +138,10 @@ uds_socket(const char *path)
if (ret == -1)
goto err;
+ ret = chmod(un.sun_path, 0666);
+ if (ret == 0)
+ RTE_LOG(INFO, VHOST_CONFIG, "chmod 0666, ok\n");
That doesn't seem right to me. Doing that kind of change in a libraray
doesn't seem to be a good practice, don't even to say changing it to
"0666" blindly, which allows every body to access it.

--yliu
Post by Jianfeng Tan
+
return sockfd;
--
2.1.4
Tan, Jianfeng
2015-11-09 05:15:23 UTC
Permalink
-----Original Message-----
Sent: Monday, November 9, 2015 11:55 AM
To: Tan, Jianfeng
Subject: Re: [dpdk-dev] [RFC 5/5] vhost/container: change mode of vhost
listening socket
Change vhost listening socket mode so that users in groups and others
can connect to vhost listening socket.
---
lib/librte_vhost/vhost_user/vhost-net-user.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 2dc0547..7b24f7c 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -42,6 +42,7 @@
#include <sys/un.h>
#include <errno.h>
#include <pthread.h>
+#include <sys/stat.h>
#include <rte_log.h>
#include <rte_virtio_net.h>
@@ -137,6 +138,10 @@ uds_socket(const char *path)
if (ret == -1)
goto err;
+ ret = chmod(un.sun_path, 0666);
+ if (ret == 0)
+ RTE_LOG(INFO, VHOST_CONFIG, "chmod 0666, ok\n");
That doesn't seem right to me. Doing that kind of change in a libraray doesn't
seem to be a good practice, don't even to say changing it to "0666" blindly,
which allows every body to access it.
--yliu
Hi Yuanhan,

The original intention for this change is for the use case: use "root" to
start ovs-dpdk (or any other switch application), but use other users to
run some containers. Not with this change, other users cannot connect
to vhost listening socket.

This change is not necessary if using root to start a container. It's indeed
a question worth discussion: whether it's reasonable to allow everybody
to start a virtio device.

Thanks,
Jianfeng
+
return sockfd;
--
2.1.4
Yuanhan Liu
2015-11-09 05:40:58 UTC
Permalink
On Mon, Nov 09, 2015 at 05:15:23AM +0000, Tan, Jianfeng wrote:
...
Post by Tan, Jianfeng
Post by Jianfeng Tan
+ ret = chmod(un.sun_path, 0666);
+ if (ret == 0)
+ RTE_LOG(INFO, VHOST_CONFIG, "chmod 0666, ok\n");
That doesn't seem right to me. Doing that kind of change in a libraray doesn't
seem to be a good practice, don't even to say changing it to "0666" blindly,
which allows every body to access it.
--yliu
Hi Yuanhan,
The original intention for this change is for the use case: use "root" to
start ovs-dpdk (or any other switch application), but use other users to
run some containers. Not with this change, other users cannot connect
to vhost listening socket.
I know your concern, do it with some user space utils (like chmod) then,
but not in a libraray.

BTW, "chown", limiting it to a specific user, or "chmod g+rw", limiting
it to a specific group, is more appropriate here.

--yliu
Post by Tan, Jianfeng
This change is not necessary if using root to start a container. It's indeed
a question worth discussion: whether it's reasonable to allow everybody
to start a virtio device.
Thanks,
Jianfeng
Post by Jianfeng Tan
+
return sockfd;
--
2.1.4
Tan, Jianfeng
2015-11-09 05:46:32 UTC
Permalink
-----Original Message-----
Sent: Monday, November 9, 2015 1:41 PM
To: Tan, Jianfeng
Subject: Re: [dpdk-dev] [RFC 5/5] vhost/container: change mode of vhost
listening socket
...
Post by Tan, Jianfeng
Post by Yuanhan Liu
Post by Jianfeng Tan
+ ret = chmod(un.sun_path, 0666);
+ if (ret == 0)
+ RTE_LOG(INFO, VHOST_CONFIG, "chmod 0666, ok\n");
That doesn't seem right to me. Doing that kind of change in a
libraray doesn't seem to be a good practice, don't even to say
changing it to "0666" blindly, which allows every body to access it.
--yliu
Hi Yuanhan,
The original intention for this change is for the use case: use "root"
to start ovs-dpdk (or any other switch application), but use other
users to run some containers. Not with this change, other users cannot
connect to vhost listening socket.
I know your concern, do it with some user space utils (like chmod) then, but
not in a libraray.
BTW, "chown", limiting it to a specific user, or "chmod g+rw", limiting it to a
specific group, is more appropriate here.
--yliu
Got your point. Consider to revert this change in next version.

Thanks!
Jianfeng
Post by Tan, Jianfeng
This change is not necessary if using root to start a container. It's
indeed a question worth discussion: whether it's reasonable to allow
everybody to start a virtio device.
Thanks,
Jianfeng
Post by Yuanhan Liu
Post by Jianfeng Tan
+
return sockfd;
--
2.1.4
Zhuangyanying
2015-11-24 03:53:00 UTC
Permalink
-----Original Message-----
Sent: Friday, November 06, 2015 2:31 AM
Subject: [RFC 0/5] virtio support for container
This patchset only acts as a PoC to request the community for comments.
This patchset is to provide high performance networking interface
(virtio) for container-based DPDK applications. The way of starting DPDK
applications in containers with ownership of NIC devices exclusively is beyond
the scope. The basic idea here is to present a new virtual device (named
eth_cvio), which can be discovered and initialized in container-based DPDK
applications rte_eal_init().
To minimize the change, we reuse already-existing virtio frontend driver code
(driver/net/virtio/).
Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
operations into unix socket/cuse protocol, which is originally provided in QEMU),
is integrated in virtio frontend driver. Aka, this new converged driver actually
plays the role of original frontend driver and the role of QEMU device
framework.
The biggest difference here lies in how to calculate relative address for backend.
The principle of virtio is that: based on one or multiple shared memory
segments, vhost maintains a reference system with the base addresses and
length of these segments so that an address from VM comes (usually GPA,
Guest Physical Address), vhost can translate it into self-recognizable address
(aka VVA, Vhost Virtual Address). To decrease the overhead of address
translation, we should maintain as few segments as better. In the context of
virtual machines, GPA is always locally continuous. So it's a good choice. In
container's case, CVA (Container Virtual Address) can be used. This means
a. when set_base_addr, CVA address is used; b. when preparing RX's
descriptors, CVA address is used; c. when transmitting packets, CVA is filled in
TX's descriptors; d. in TX and CQ's header, CVA is used.
How to share memory? In VM's case, qemu always shares all physical layout to
backend. But it's not feasible for a container, as a process, to share all virtual
memory regions to backend. So only specified virtual memory regions (type is
shared) are sent to backend. It leads to a limitation that only addresses in
these areas can be used to transmit or receive packets. For now, the shared
memory is created in /dev/shm using shm_open() in the memory initialization
process.
How to use?
a. Apply the patch of virtio for container. We need two copies of patched code
(referred as dpdk-app/ and dpdk-vhost/)
$: cd dpdk-app
$: vim config/common_linuxapp (uncomment "CONFIG_RTE_VIRTIO_VDEV=y")
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
c. To build a docker image using Dockerfile below.
$: cat ./Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0xc", "-n", "4",
"--no-huge", "--no-pci",
"--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/var/run/usvhost",
"--", "-p", "0x1"]
$: docker build -t dpdk-app-l2fwd .
$: cd dpdk-vhost
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
e. Start vhost-switch
$: ./examples/vhost/build/vhost-switch -c 3 -n 4 --socket-mem 1024,1024 -- -p 0x1 --stats 1
f. Start docker
$: docker run -i -t -v <path to vhost unix socket>:/var/run/usvhost dpdk-app-l2fwd
virtio/container: add handler for ioport rd/wr
virtio/container: add a new virtual device named eth_cvio
virtio/container: unify desc->addr assignment
virtio/container: adjust memory initialization process
vhost/container: change mode of vhost listening socket
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost-user.c | 433
+++++++++++++++++++++++++++
drivers/net/virtio/vhost-user.h | 137 +++++++++
drivers/net/virtio/virtio_ethdev.c | 319 +++++++++++++++-----
drivers/net/virtio/virtio_ethdev.h | 16 +
drivers/net/virtio/virtio_pci.h | 32 +-
drivers/net/virtio/virtio_rxtx.c | 9 +-
drivers/net/virtio/virtio_rxtx_simple.c | 9 +-
drivers/net/virtio/virtqueue.h | 9 +-
lib/librte_eal/common/include/rte_memory.h | 5 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 58 +++-
lib/librte_mempool/rte_mempool.c | 16 +-
lib/librte_vhost/vhost_user/vhost-net-user.c | 5 +
14 files changed, 967 insertions(+), 90 deletions(-) create mode 100644
drivers/net/virtio/vhost-user.c create mode 100644
drivers/net/virtio/vhost-user.h
--
2.1.4
This patch arose a good idea to add an extra abstracted IO layer, which would make it simple to extend the function to the kernel mode switch(such as OVS). That's great.
But I have one question here:
it's the issue on VHOST_USER_SET_MEM_TABLE. you alloc memory from tmpfs filesyste, just one fd, could used rte_memseg_info_get() to
directly get the memory topology, However, things change in kernel-space, because mempool should be created on each container's
hugetlbfs(rather than tmpfs), which is seperated from each other, at last, considering of the ioctl's parameter.
My solution is as follows for your reference:
/*
reg = mem->regions;
reg->guest_phys_addr = (__u64) ((struct virtqueue *)(dev->data->rx_queues[0]))->mpool->elt_va_start;
reg->userspace_addr = reg->guest_phys_addr;
reg->memory_size = ((struct virtqueue *)(dev->data->rx_queues[0]))->mpool->elt_va_end - reg->guest_phys_addr;

reg = mem->regions + 1;
reg->guest_phys_addr = (__u64)(((struct virtqueue *)(dev->data->tx_queues[0]))->virtio_net_hdr_mem);
reg->userspace_addr = reg->guest_phys_addr;
reg->memory_size = vq_size * internals->vtnet_hdr_size;
*/
But it's a little ugly, any better idea?
Tan, Jianfeng
2015-11-24 06:19:07 UTC
Permalink
-----Original Message-----
Sent: Tuesday, November 24, 2015 11:53 AM
Qiu, Michael; Guohongzhen; Zhoujingbin; Zhangbo (Oscar); gaoxiaoqiu;
Zhbzg; Xie, Huawei
Subject: RE: [RFC 0/5] virtio support for container
-----Original Message-----
Sent: Friday, November 06, 2015 2:31 AM
Zhangbo
Subject: [RFC 0/5] virtio support for container
...
2.1.4
This patch arose a good idea to add an extra abstracted IO layer, which
would make it simple to extend the function to the kernel mode switch(such
as OVS). That's great.
it's the issue on VHOST_USER_SET_MEM_TABLE. you alloc memory from
tmpfs filesyste, just one fd, could used rte_memseg_info_get() to
directly get the memory topology, However, things change in kernel-
space, because mempool should be created on each container's
hugetlbfs(rather than tmpfs), which is seperated from each other, at
last, considering of the ioctl's parameter.
/*
reg = mem->regions;
reg->guest_phys_addr = (__u64) ((struct virtqueue *)(dev->data-
rx_queues[0]))->mpool->elt_va_start;
reg->userspace_addr = reg->guest_phys_addr;
reg->memory_size = ((struct virtqueue *)(dev->data-
rx_queues[0]))->mpool->elt_va_end - reg->guest_phys_addr;
reg = mem->regions + 1;
reg->guest_phys_addr = (__u64)(((struct virtqueue *)(dev->data-
tx_queues[0]))->virtio_net_hdr_mem);
reg->userspace_addr = reg->guest_phys_addr;
reg->memory_size = vq_size * internals->vtnet_hdr_size;
*/
But it's a little ugly, any better idea?
Hi Yanying,

Your solution seems ok for me when used with kernel vhost-net, because vhost
kthread just shares the same mm_struct with virtio process. But it will not work
with vhost-user, which realize memory sharing through putting fd in sendmsg().
Worse, it will not work with userspace vhost_cuse (see
lib/librte_vhost/vhost_cuse/), either, because current implementation supposes
VM's physical memory is backed by one huge file. Actually, what we need to do
Is enhancing userspace vhost_cuse, so that it supports cross-file memory region.

With below solutions to support hugetlbfs FYI:

To support hugetlbfs, my previous idea is to use -v option of "docker run"
to map hugetlbfs into its /dev/shm, so that we can create a "huge" shm file
on hugetlbfs. But this seems not accepted by others.

You mentioned the situation that DPDK now creates a file for each hugepage.
Maybe we just need to share all these hugepages with vhost. To minimize the
memory translation effort, we need to require that we use as few pages as
possible. Can you accept this solution?

Thanks,
Jianfeng
Jianfeng Tan
2016-01-10 11:42:58 UTC
Permalink
This patchset is to provide high performance networking interface (virtio)
for container-based DPDK applications. The way of starting DPDK apps in
containers with ownership of NIC devices exclusively is beyond the scope.
The basic idea here is to present a new virtual device (named eth_cvio),
which can be discovered and initialized in container-based DPDK apps using
rte_eal_init(). To minimize the change, we reuse already-existing virtio
frontend driver code (driver/net/virtio/).

Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
operations into unix socket/cuse protocol, which is originally provided in
QEMU), is integrated in virtio frontend driver. So this converged driver
actually plays the role of original frontend driver and the role of QEMU
device framework.

The major difference lies in how to calculate relative address for vhost.
The principle of virtio is that: based on one or multiple shared memory
segments, vhost maintains a reference system with the base addresses and
length for each segment so that an address from VM comes (usually GPA,
Guest Physical Address) can be translated into vhost-recognizable address
(named VVA, Vhost Virtual Address). To decrease the overhead of address
translation, we should maintain as few segments as possible. In VM's case,
GPA is always locally continuous. In container's case, CVA (Container
Virtual Address) can be used. Specifically:
a. when set_base_addr, CVA address is used;
b. when preparing RX's descriptors, CVA address is used;
c. when transmitting packets, CVA is filled in TX's descriptors;
d. in TX and CQ's header, CVA is used.

How to share memory? In VM's case, qemu always shares all physical layout
to backend. But it's not feasible for a container, as a process, to share
all virtual memory regions to backend. So only specified virtual memory
regions (with type of shared) are sent to backend. It's a limitation that
only addresses in these areas can be used to transmit or receive packets.

Known issues

a. When used with vhost-net, root privilege is required to create tap
device inside.
b. Control queue and multi-queue are not supported yet.
c. When --single-file option is used, socket_id of the memory may be
wrong. (Use "numactl -N x -m x" to work around this for now)

How to use?

a. Apply this patchset.

b. To compile container apps:
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc

c. To build a docker image using Dockerfile below.
$: cat ./Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
$: docker build -t dpdk-app-l2fwd .

d. Used with vhost-user
$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
--socket-mem 1024,1024 -- -p 0x1 --stats 1
$: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1

f. Used with vhost-net
$: modprobe vhost
$: modprobe vhost-net
$: docker run -i -t --privileged \
-v /dev/vhost-net:/dev/vhost-net \
-v /dev/net/tun:/dev/net/tun \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1

By the way, it's not necessary to run in a container.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>

Jianfeng Tan (4):
mem: add --single-file to create single mem-backed file
mem: add API to obstain memory-backed file info
virtio/vdev: add ways to interact with vhost
virtio/vdev: add a new vdev named eth_cvio

config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++++++++
drivers/net/virtio/vhost.h | 192 ++++++++
drivers/net/virtio/virtio_ethdev.c | 338 ++++++++++---
drivers/net/virtio/virtio_ethdev.h | 4 +
drivers/net/virtio/virtio_pci.h | 52 +-
drivers/net/virtio/virtio_rxtx.c | 11 +-
drivers/net/virtio/virtio_rxtx_simple.c | 14 +-
drivers/net/virtio/virtqueue.h | 13 +-
lib/librte_eal/common/eal_common_options.c | 17 +
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 +
lib/librte_eal/common/include/rte_memory.h | 16 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 82 +++-
15 files changed, 1392 insertions(+), 93 deletions(-)
create mode 100644 drivers/net/virtio/vhost.c
create mode 100644 drivers/net/virtio/vhost.h
--
2.1.4
Jianfeng Tan
2016-01-10 11:42:59 UTC
Permalink
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.

For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.

Known issue:
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
lib/librte_eal/common/eal_common_options.c | 17 +++++++++++
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 ++
lib/librte_eal/linuxapp/eal/eal_memory.c | 45 ++++++++++++++++++++++++++----
4 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
index 29942ea..65bccbd 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -95,6 +95,7 @@ eal_long_options[] = {
{OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM },
{OPT_VMWARE_TSC_MAP, 0, NULL, OPT_VMWARE_TSC_MAP_NUM },
{OPT_XEN_DOM0, 0, NULL, OPT_XEN_DOM0_NUM },
+ {OPT_SINGLE_FILE, 0, NULL, OPT_SINGLE_FILE_NUM },
{0, 0, NULL, 0 }
};

@@ -897,6 +898,10 @@ eal_parse_common_option(int opt, const char *optarg,
}
break;

+ case OPT_SINGLE_FILE_NUM:
+ conf->single_file = 1;
+ break;
+
/* don't know what to do, leave this to caller */
default:
return 1;
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }

if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink) {
RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
@@ -994,6 +1009,8 @@ eal_common_usage(void)
" -n CHANNELS Number of memory channels\n"
" -m MB Memory to allocate (see also --"OPT_SOCKET_MEM")\n"
" -r RANKS Force number of memory ranks (don't detect)\n"
+ " --"OPT_SINGLE_FILE" Create just single file for shared memory, and \n"
+ " do not promise physical contiguity of memseg\n"
" -b, --"OPT_PCI_BLACKLIST" Add a PCI device in black list.\n"
" Prevent EAL from using this PCI device. The argument\n"
" format is <domain:bus:devid.func>.\n"
diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h
index 5f1367e..9117ed9 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -61,6 +61,7 @@ struct hugepage_info {
*/
struct internal_config {
volatile size_t memory; /**< amount of asked memory */
+ volatile unsigned single_file; /**< mmap all hugepages in single file */
volatile unsigned force_nchannel; /**< force number of channels */
volatile unsigned force_nrank; /**< force number of ranks */
volatile unsigned no_hugetlbfs; /**< true to disable hugetlbfs */
diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h
index a881c62..e5da14a 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -83,6 +83,8 @@ enum {
OPT_VMWARE_TSC_MAP_NUM,
#define OPT_XEN_DOM0 "xen-dom0"
OPT_XEN_DOM0_NUM,
+#define OPT_SINGLE_FILE "single-file"
+ OPT_SINGLE_FILE_NUM,
OPT_LONG_MAX_NUM
};

diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 846fd31..2bb1163 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -80,6 +80,10 @@
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/time.h>
+#include <mntent.h>
+#include <sys/mman.h>
+#include <sys/file.h>
+#include <sys/vfs.h>

#include <rte_log.h>
#include <rte_memory.h>
@@ -92,6 +96,9 @@
#include <rte_common.h>
#include <rte_string_fns.h>

+#define _GNU_SOURCE
+#include <sys/syscall.h>
+
#include "eal_private.h"
#include "eal_internal_cfg.h"
#include "eal_filesystem.h"
@@ -768,6 +775,7 @@ create_shared_memory(const char *filename, const size_t mem_size)
}
retval = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
+
return retval;
}

@@ -1110,10 +1118,34 @@ rte_eal_hugepage_init(void)
/* get pointer to global configuration */
mcfg = rte_eal_get_configuration()->mem_config;

- /* hugetlbfs can be disabled */
- if (internal_config.no_hugetlbfs) {
+ /* when hugetlbfs is disabled or single-file option is specified */
+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
+ int fd;
+ uint64_t pagesize;
+ unsigned socket_id;
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ syscall(SYS_getcpu, NULL, &socket_id, NULL);
+
+ if (internal_config.no_hugetlbfs) {
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ "/dev/shm", 0);
+ pagesize = RTE_PGSIZE_4K;
+ } else {
+ struct hugepage_info *hpi = &internal_config.hugepage_info[0];
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ hpi->hugedir, 0);
+ pagesize = hpi->hugepage_sz;
+ }
+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n", __func__,
+ filepath, strerror(errno));
+ return -1;
+ }
+
addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ MAP_SHARED | MAP_POPULATE, fd, 0);
if (addr == MAP_FAILED) {
RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
strerror(errno));
@@ -1121,9 +1153,12 @@ rte_eal_hugepage_init(void)
}
mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
+
+ close(fd);
+
return 0;
}
--
2.1.4
Xie, Huawei
2016-01-21 01:57:45 UTC
Permalink
On 1/11/2016 2:43 AM, Tan, Jianfeng wrote:
[snip]
Post by Jianfeng Tan
+#include <mntent.h>
+#include <sys/mman.h>
+#include <sys/file.h>
+#include <sys/vfs.h>
Please remove unreferenced header files.
Post by Jianfeng Tan
#include <rte_log.h>
#include <rte_memory.h>
@@ -92,6 +96,9 @@
#include <rte_common.h>
#include <rte_string_fns.h>
+#define _GNU_SOURCE
+#include <sys/syscall.h>
+
#include "eal_private.h"
[snip]
Post by Jianfeng Tan
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ syscall(SYS_getcpu, NULL, &socket_id, NULL);
+
[snip]
Post by Jianfeng Tan
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
Anyway the socket_id here doesn't make sense. We could remove the
syscall which relies on _GNU_SOURCE.
Jianfeng Tan
2016-01-10 11:43:00 UTC
Permalink
A new API named rte_eal_get_backfile_info() and a new data
struct back_file is added to obstain information of memory-
backed file info.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
lib/librte_eal/common/include/rte_memory.h | 16 +++++++++++++
lib/librte_eal/linuxapp/eal/eal_memory.c | 37 ++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)

diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 9c9e40f..75ef8db 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -109,6 +109,22 @@ struct rte_memseg {
} __rte_packed;

/**
+ * This struct is used to store information about memory-backed file that
+ * we mapped in memory initialization.
+ */
+struct back_file {
+ void *addr; /**< virtual addr */
+ size_t size; /**< the page size */
+ char filepath[PATH_MAX]; /**< path to backing file on filesystem */
+};
+
+/**
+ * Get the hugepage file information. Caller to free.
+ * Return number of hugepage files used.
+ */
+int rte_eal_get_backfile_info(struct back_file **);
+
+/**
* Lock page in physical memory and prevent from swapping.
*
* @param virt
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 2bb1163..6ca1404 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -758,6 +758,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi)
return 0;
}

+static struct hugepage_file *hugepage_files;
+static int num_hugepage_files;
+
/*
* Uses mmap to create a shared memory area for storage of data
* Used in this file to store the hugepage file map on disk
@@ -776,9 +779,29 @@ create_shared_memory(const char *filename, const size_t mem_size)
retval = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);

+ hugepage_files = retval;
+ num_hugepage_files = mem_size / (sizeof(struct hugepage_file));
+
return retval;
}

+int
+rte_eal_get_backfile_info(struct back_file **p)
+{
+ struct back_file *backfiles;
+ int i, num_backfiles = num_hugepage_files;
+
+ backfiles = malloc(sizeof(struct back_file) * num_backfiles);
+ for (i = 0; i < num_backfiles; ++i) {
+ backfiles[i].addr = hugepage_files[i].final_va;
+ backfiles[i].size = hugepage_files[i].size;
+ strcpy(backfiles[i].filepath, hugepage_files[i].filepath);
+ }
+
+ *p = backfiles;
+ return num_backfiles;
+}
+
/*
* this copies *active* hugepages from one hugepage table to another.
* destination is typically the shared memory.
@@ -1157,6 +1180,20 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = socket_id;

+ hugepage = create_shared_memory(eal_hugepage_info_path(),
+ sizeof(struct hugepage_file));
+ hugepage->orig_va = addr;
+ hugepage->final_va = addr;
+ hugepage->physaddr = rte_mem_virt2phy(addr);
+ hugepage->size = pagesize;
+ hugepage->socket_id = socket_id;
+ hugepage->file_id = 0;
+ hugepage->memseg_id = 0;
+#ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
+ hugepage->repeated = internal_config.memory / pagesize;
+#endif
+ strncpy(hugepage->filepath, filepath, MAX_HUGEPAGE_PATH);
+
close(fd);

return 0;
--
2.1.4
Pavel Fedin
2016-01-11 11:43:54 UTC
Permalink
Hello!
-----Original Message-----
Sent: Sunday, January 10, 2016 2:43 PM
Subject: [PATCH 2/4] mem: add API to obstain memory-backed file info
"obtain" - typo in subject
A new API named rte_eal_get_backfile_info() and a new data
struct back_file is added to obstain information of memory-
backed file info.
---
lib/librte_eal/common/include/rte_memory.h | 16 +++++++++++++
lib/librte_eal/linuxapp/eal/eal_memory.c | 37 ++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)
diff --git a/lib/librte_eal/common/include/rte_memory.h
b/lib/librte_eal/common/include/rte_memory.h
index 9c9e40f..75ef8db 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -109,6 +109,22 @@ struct rte_memseg {
} __rte_packed;
/**
+ * This struct is used to store information about memory-backed file that
+ * we mapped in memory initialization.
+ */
+struct back_file {
+ void *addr; /**< virtual addr */
+ size_t size; /**< the page size */
+ char filepath[PATH_MAX]; /**< path to backing file on filesystem */
+};
+
+/**
+ * Get the hugepage file information. Caller to free.
+ * Return number of hugepage files used.
+ */
+int rte_eal_get_backfile_info(struct back_file **);
+
+/**
* Lock page in physical memory and prevent from swapping.
*
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 2bb1163..6ca1404 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -758,6 +758,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
*hpi)
return 0;
}
+static struct hugepage_file *hugepage_files;
+static int num_hugepage_files;
+
/*
* Uses mmap to create a shared memory area for storage of data
* Used in this file to store the hugepage file map on disk
@@ -776,9 +779,29 @@ create_shared_memory(const char *filename, const size_t mem_size)
retval = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
+ hugepage_files = retval;
+ num_hugepage_files = mem_size / (sizeof(struct hugepage_file));
+
return retval;
}
+int
+rte_eal_get_backfile_info(struct back_file **p)
+{
+ struct back_file *backfiles;
+ int i, num_backfiles = num_hugepage_files;
+
+ backfiles = malloc(sizeof(struct back_file) * num_backfiles);
+ for (i = 0; i < num_backfiles; ++i) {
+ backfiles[i].addr = hugepage_files[i].final_va;
+ backfiles[i].size = hugepage_files[i].size;
+ strcpy(backfiles[i].filepath, hugepage_files[i].filepath);
+ }
+
+ *p = backfiles;
+ return num_backfiles;
+}
+
/*
* this copies *active* hugepages from one hugepage table to another.
* destination is typically the shared memory.
@@ -1157,6 +1180,20 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = socket_id;
+ hugepage = create_shared_memory(eal_hugepage_info_path(),
+ sizeof(struct hugepage_file));
+ hugepage->orig_va = addr;
+ hugepage->final_va = addr;
+ hugepage->physaddr = rte_mem_virt2phy(addr);
+ hugepage->size = pagesize;
+ hugepage->socket_id = socket_id;
+ hugepage->file_id = 0;
+ hugepage->memseg_id = 0;
+#ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
+ hugepage->repeated = internal_config.memory / pagesize;
+#endif
+ strncpy(hugepage->filepath, filepath, MAX_HUGEPAGE_PATH);
+
close(fd);
return 0;
--
2.1.4
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Rich Lane
2016-01-11 20:26:12 UTC
Permalink
Post by Jianfeng Tan
@@ -1157,6 +1180,20 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = socket_id;
+ hugepage = create_shared_memory(eal_hugepage_info_path(),
+ sizeof(struct hugepage_file));
+ hugepage->orig_va = addr;
+ hugepage->final_va = addr;
+ hugepage->physaddr = rte_mem_virt2phy(addr);
+ hugepage->size = pagesize;
Should this be "hugepage->size = internal_config.memory"? Otherwise the
vhost-user
memtable entry has a size of only 2MB.
Tan, Jianfeng
2016-01-12 09:12:13 UTC
Permalink
Hi!
Post by Jianfeng Tan
@@ -1157,6 +1180,20 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = socket_id;
+ hugepage =
create_shared_memory(eal_hugepage_info_path(),
+ sizeof(struct hugepage_file));
+ hugepage->orig_va = addr;
+ hugepage->final_va = addr;
+ hugepage->physaddr = rte_mem_virt2phy(addr);
+ hugepage->size = pagesize;
Should this be "hugepage->size = internal_config.memory"? Otherwise
the vhost-user
memtable entry has a size of only 2MB.
I don't think so. See the definition:

47 struct hugepage_file {
48 void *orig_va; /**< virtual addr of first mmap() */
49 void *final_va; /**< virtual addr of 2nd mmap() */
50 uint64_t physaddr; /**< physical addr */
51 size_t size; /**< the page size */
52 int socket_id; /**< NUMA socket ID */
53 int file_id; /**< the '%d' in HUGEFILE_FMT */
54 int memseg_id; /**< the memory segment to which page
belongs */
55 #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
56 int repeated; /**< number of times the page size
is repeated */
57 #endif
58 char filepath[MAX_HUGEPAGE_PATH]; /**< path to backing file
on filesystem */
59 };

size stands for the page size instead of total size.

Thanks,
Jianfeng
Pavel Fedin
2016-01-12 10:04:04 UTC
Permalink
Hello!
Post by Tan, Jianfeng
Should this be "hugepage->size = internal_config.memory"? Otherwise the vhost-user
memtable entry has a size of only 2MB.
47 struct hugepage_file {
48 void *orig_va; /**< virtual addr of first mmap() */
49 void *final_va; /**< virtual addr of 2nd mmap() */
50 uint64_t physaddr; /**< physical addr */
51 size_t size; /**< the page size */
52 int socket_id; /**< NUMA socket ID */
53 int file_id; /**< the '%d' in HUGEFILE_FMT */
54 int memseg_id; /**< the memory segment to which page belongs */
55 #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
56 int repeated; /**< number of times the page size is repeated */
57 #endif
58 char filepath[MAX_HUGEPAGE_PATH]; /**< path to backing file on filesystem */
59 };
size stands for the page size instead of total size.
But in this case host gets this page size for total region size, therefore qva_to_vva() fails.
I haven't worked with hugepages, but i guess that with real hugepages we get one file per page, therefore page size == mapping size. With newly introduced --single-file we now have something that pretends to be a single "uber-huge-page", so we need to specify total size of the mapping here.

BTW, i'm still unhappy about ABI breakage here. I think we could easily add --shared-mem option, which would simply change mapping mode to SHARED. So, we could use it with both hugepages (default) and plain mmap (with --no-hugepages).

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Tan, Jianfeng
2016-01-12 10:48:09 UTC
Permalink
Hello!
Post by Pavel Fedin
But in this case host gets this page size for total region size, therefore qva_to_vva() fails.
I haven't worked with hugepages, but i guess that with real hugepages we get one file per page, therefore page size == mapping size. With newly introduced --single-file we now have something that pretends to be a single "uber-huge-page", so we need to specify total size of the mapping here.
Oh I get it and recognize the problem here. The actual problem lies in
the API rte_eal_get_backfile_info().
backfiles[i].size = hugepage_files[i].size;
Should use statfs or hugepage_files[i].size * hugepage_files[i].repeated
to calculate the total size.
Post by Pavel Fedin
BTW, i'm still unhappy about ABI breakage here. I think we could easily add --shared-mem option, which would simply change mapping mode to SHARED. So, we could use it with both hugepages (default) and plain mmap (with --no-hugepages).
You mean, use "--no-hugepages --shared-mem" together, right?
That makes sense to me.

Thanks,
Jianfeng
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 11:00:42 UTC
Permalink
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
BTW, i'm still unhappy about ABI breakage here. I think we could easily add --shared-mem
option, which would simply change mapping mode to SHARED. So, we could use it with both
hugepages (default) and plain mmap (with --no-hugepages).
You mean, use "--no-hugepages --shared-mem" together, right?
Yes. This would be perfectly backwards-compatible because.

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Sergio Gonzalez Monroy
2016-01-12 11:07:40 UTC
Permalink
Hi Pavel,
Post by Pavel Fedin
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
BTW, i'm still unhappy about ABI breakage here. I think we could easily add --shared-mem
option, which would simply change mapping mode to SHARED. So, we could use it with both
hugepages (default) and plain mmap (with --no-hugepages).
You mean, use "--no-hugepages --shared-mem" together, right?
Yes. This would be perfectly backwards-compatible because.
So are you suggesting to not introduce --single-file option but instead
--shared-mem?
AFAIK --single-file was trying to workaround the limitation of just
being able to map 8 fds.

Sergio
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 11:37:54 UTC
Permalink
Hello!
Post by Sergio Gonzalez Monroy
So are you suggesting to not introduce --single-file option but instead
--shared-mem?
AFAIK --single-file was trying to workaround the limitation of just
being able to map 8 fds.
Heh, yes, you're right... Indeed, sorry, i was not patient enough, i see it uses hpi->hugedir instead of using /dev/shm... I was confused by the code path... It seemed that --single-file is an alias to --no-hugepages.
And the patch still changes mmap() mode to SHARED unconditionally, which is not good in terms of backwards compability (and this is explicitly noticed in the cover letter).

So, let's try to sort out...
a) By default we should still have MAP_PRIVATE
b) Let's say that we need --shared-mem in order to make it MAP_SHARED. This can be combined with --no-hugepages if necessary (this is what i tried to implement based on the old RFC).
c) Let's say that --single-file uses hugetlbfs but maps everything via single file. This still can be combined with --shared-mem.

wouldn't this be more clear, more straightforward and implication-free?

And if we agree on that, we could now try to decrease number of options:
a) We could imply MAP_SHARED if cvio is used, because shared memory is mandatory in this case.
b) (c) above again raises a question: doesn't it make CONFIG_RTE_EAL_SIGLE_FILE_SEGMENTS obsolete? Or may be we could use that one instead of --single-file (however i'm not a fan of compile-time configuration like this)?

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Sergio Gonzalez Monroy
2016-01-12 12:12:20 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Sergio Gonzalez Monroy
So are you suggesting to not introduce --single-file option but instead
--shared-mem?
AFAIK --single-file was trying to workaround the limitation of just
being able to map 8 fds.
Heh, yes, you're right... Indeed, sorry, i was not patient enough, i see it uses hpi->hugedir instead of using /dev/shm... I was confused by the code path... It seemed that --single-file is an alias to --no-hugepages.
And the patch still changes mmap() mode to SHARED unconditionally, which is not good in terms of backwards compability (and this is explicitly noticed in the cover letter).
I might be missing something obvious here but, aside from having memory
SHARED which most DPDK apps using hugepages will have anyway, what is
the backward compatibility issues that you see here?
Post by Pavel Fedin
So, let's try to sort out...
a) By default we should still have MAP_PRIVATE
b) Let's say that we need --shared-mem in order to make it MAP_SHARED. This can be combined with --no-hugepages if necessary (this is what i tried to implement based on the old RFC).
--share-mem would only have meaning with --no-huge, right?
Post by Pavel Fedin
c) Let's say that --single-file uses hugetlbfs but maps everything via single file. This still can be combined with --shared-mem.
By default, when using hugepages all mappings are SHARED for
multiprocess model.
IMHO If you really want to have the ability to have private memory
instead because you are not considering that model, then it might be
more appropriate to have --private-mem or --no-shared-mem option instead.

Sergio
Post by Pavel Fedin
wouldn't this be more clear, more straightforward and implication-free?
a) We could imply MAP_SHARED if cvio is used, because shared memory is mandatory in this case.
b) (c) above again raises a question: doesn't it make CONFIG_RTE_EAL_SIGLE_FILE_SEGMENTS obsolete? Or may be we could use that one instead of --single-file (however i'm not a fan of compile-time configuration like this)?
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 13:57:19 UTC
Permalink
Hello!
Post by Sergio Gonzalez Monroy
I might be missing something obvious here but, aside from having memory
SHARED which most DPDK apps using hugepages will have anyway, what is
the backward compatibility issues that you see here?
Heh, sorry once again for confusing. Indeed, with hugepages we always get MAP_SHARED. I missed that. So, we indeed need
--shared-mem only in addition to --no-huge.

Backwards compatibility issue is stated in the description of PATCH 1/4:
--- cut ---
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
--- cut ---
The patch unconditionally changes that to SHARED. That's all.

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Sergio Gonzalez Monroy
2016-01-12 14:13:21 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Sergio Gonzalez Monroy
I might be missing something obvious here but, aside from having memory
SHARED which most DPDK apps using hugepages will have anyway, what is
the backward compatibility issues that you see here?
Heh, sorry once again for confusing. Indeed, with hugepages we always get MAP_SHARED. I missed that. So, we indeed need
--shared-mem only in addition to --no-huge.
--- cut ---
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
--- cut ---
The patch unconditionally changes that to SHARED. That's all.
I should read more carefully!
Sorry about that, I thought you were the one with the ABI concerns.

Regarding ABI, I don't think there is any ABI issue with the change, we
just have our memory file-backed and SHARED but we do that when using
hugepages so I don't think it would be a huge issue.
But if folks have concerns about it, we could always keep old behavior
by default and, as you suggest, introduce another option for changing
the flag.

Sergio
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Sergio Gonzalez Monroy
2016-01-12 11:44:43 UTC
Permalink
Post by Sergio Gonzalez Monroy
Hi Pavel,
Post by Pavel Fedin
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
BTW, i'm still unhappy about ABI breakage here. I think we could
easily add --shared-mem
Could you elaborate a bit more on your concerns regarding ABI breakage ?
Post by Sergio Gonzalez Monroy
Post by Pavel Fedin
Post by Pavel Fedin
option, which would simply change mapping mode to SHARED. So, we could use it with both
hugepages (default) and plain mmap (with --no-hugepages).
You mean, use "--no-hugepages --shared-mem" together, right?
Yes. This would be perfectly backwards-compatible because.
So are you suggesting to not introduce --single-file option but
instead --shared-mem?
AFAIK --single-file was trying to workaround the limitation of just
being able to map 8 fds.
My bad, I misread the posts.
Jianfeng pointed out that you are suggesting to have --shared-mem to
have same functionality
with or without hugepages.

Sergio
Post by Sergio Gonzalez Monroy
Sergio
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 11:22:03 UTC
Permalink
Hello!
Post by Tan, Jianfeng
Oh I get it and recognize the problem here. The actual problem lies in
the API rte_eal_get_backfile_info().
backfiles[i].size = hugepage_files[i].size;
Should use statfs or hugepage_files[i].size * hugepage_files[i].repeated
to calculate the total size.
.repeated depends on CONFIG_RTE_EAL_SIGLE_FILE_SEGMENTS. By the way, looks like it does the same thing as you are trying to do with --single-file, but with hugepages, doesn't it? I see it's currently used by ivshmem (which is AFAIK very immature and half-abandoned).
Or should we just move .repeated out of the #ifdef ?

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Sergio Gonzalez Monroy
2016-01-12 11:33:03 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Tan, Jianfeng
Oh I get it and recognize the problem here. The actual problem lies in
the API rte_eal_get_backfile_info().
backfiles[i].size = hugepage_files[i].size;
Should use statfs or hugepage_files[i].size * hugepage_files[i].repeated
to calculate the total size.
.repeated depends on CONFIG_RTE_EAL_SIGLE_FILE_SEGMENTS. By the way, looks like it does the same thing as you are trying to do with --single-file, but with hugepages, doesn't it? I see it's currently used by ivshmem (which is AFAIK very immature and half-abandoned).
Similar but not the same.
--single-file: a single file for all mapped hugepages.
SINGLE_FILE_SEGMENTS: a file per set of physically contiguous mapped
hugepages (what DPDK calls memseg , memory segment). So there could be
more than one file.

Sergio
Post by Pavel Fedin
Or should we just move .repeated out of the #ifdef ?
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 12:01:30 UTC
Permalink
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
.repeated depends on CONFIG_RTE_EAL_SIGLE_FILE_SEGMENTS. By the way, looks like it does
the same thing as you are trying to do with --single-file, but with hugepages, doesn't it? I
see it's currently used by ivshmem (which is AFAIK very immature and half-abandoned).
Similar but not the same.
--single-file: a single file for all mapped hugepages.
SINGLE_FILE_SEGMENTS: a file per set of physically contiguous mapped
hugepages (what DPDK calls memseg , memory segment). So there could be
more than one file.
Thank you for the explanation.

By this time, i've done more testing. Current patchset breaks --no-huge. I did not study why:
--- cut ---
Program received signal SIGBUS, Bus error.
malloc_elem_init (elem=***@entry=0x7fffe51e6000, heap=0x7ffff7fe5a1c, ms=***@entry=0x7ffff7fb301c, size=***@entry=268435392) at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_elem.c:62
62 /home/p.fedin/dpdk/lib/librte_eal/common/malloc_elem.c: No such file or directory.
Missing separate debuginfos, use: dnf debuginfo-install keyutils-libs-1.5.9-7.fc23.x86_64 krb5-libs-1.13.2-11.fc23.x86_64 libcap-ng-0.7.7-2.fc23.x86_64 libcom_err-1.42.13-3.fc23.x86_64 libselinux-2.4-4.fc23.x86_64 openssl-libs-1.0.2d-2.fc23.x86_64 pcre-8.37-4.fc23.x86_64 zlib-1.2.8-9.fc23.x86_64
(gdb) where
#0 malloc_elem_init (elem=***@entry=0x7fffe51e6000, heap=0x7ffff7fe5a1c, ms=***@entry=0x7ffff7fb301c, size=***@entry=268435392)
at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_elem.c:62
#1 0x00000000004a50b5 in malloc_heap_add_memseg (ms=0x7ffff7fb301c, heap=<optimized out>) at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_heap.c:109
#2 rte_eal_malloc_heap_init () at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_heap.c:232
#3 0x00000000004be896 in rte_eal_memzone_init () at /home/p.fedin/dpdk/lib/librte_eal/common/eal_common_memzone.c:427
#4 0x000000000042ab02 in rte_eal_init (argc=***@entry=11, argv=***@entry=0x7fffffffeb80) at /home/p.fedin/dpdk/lib/librte_eal/linuxapp/eal/eal.c:799
#5 0x000000000066dfb9 in dpdk_init (argc=11, argv=0x7fffffffeb80) at lib/netdev-dpdk.c:2192
#6 0x000000000040ddd9 in main (argc=12, argv=0x7fffffffeb78) at vswitchd/ovs-vswitchd.c:74
--- cut ---

And now i tend to think that we do not need --single-file at all. Because:
a) It's just a temporary workaround for "more than 8 regions" problem.
b) It's not compatible with physical hardware anyway.

So i think that we could easily use "--no-huge --shared-mem" combination. We could address hugepages compatibility problem later.

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Sergio Gonzalez Monroy
2016-01-12 13:39:12 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
.repeated depends on CONFIG_RTE_EAL_SIGLE_FILE_SEGMENTS. By the way, looks like it does
the same thing as you are trying to do with --single-file, but with hugepages, doesn't it? I
see it's currently used by ivshmem (which is AFAIK very immature and half-abandoned).
Similar but not the same.
--single-file: a single file for all mapped hugepages.
SINGLE_FILE_SEGMENTS: a file per set of physically contiguous mapped
hugepages (what DPDK calls memseg , memory segment). So there could be
more than one file.
Thank you for the explanation.
--- cut ---
Program received signal SIGBUS, Bus error.
62 /home/p.fedin/dpdk/lib/librte_eal/common/malloc_elem.c: No such file or directory.
Missing separate debuginfos, use: dnf debuginfo-install keyutils-libs-1.5.9-7.fc23.x86_64 krb5-libs-1.13.2-11.fc23.x86_64 libcap-ng-0.7.7-2.fc23.x86_64 libcom_err-1.42.13-3.fc23.x86_64 libselinux-2.4-4.fc23.x86_64 openssl-libs-1.0.2d-2.fc23.x86_64 pcre-8.37-4.fc23.x86_64 zlib-1.2.8-9.fc23.x86_64
(gdb) where
at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_elem.c:62
#1 0x00000000004a50b5 in malloc_heap_add_memseg (ms=0x7ffff7fb301c, heap=<optimized out>) at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_heap.c:109
#2 rte_eal_malloc_heap_init () at /home/p.fedin/dpdk/lib/librte_eal/common/malloc_heap.c:232
#3 0x00000000004be896 in rte_eal_memzone_init () at /home/p.fedin/dpdk/lib/librte_eal/common/eal_common_memzone.c:427
#5 0x000000000066dfb9 in dpdk_init (argc=11, argv=0x7fffffffeb80) at lib/netdev-dpdk.c:2192
#6 0x000000000040ddd9 in main (argc=12, argv=0x7fffffffeb78) at vswitchd/ovs-vswitchd.c:74
--- cut ---
a) It's just a temporary workaround for "more than 8 regions" problem.
b) It's not compatible with physical hardware anyway.
That's a good summary.
I think --single-file was mostly solving the limit of vhost only mapping
8 fds. We end up with a single memseg as we do with --no-huge except
that they are hugepages (well, also in this patch mapped with shared
instead of private).
Also, It would be compatible with physical hardware if using iommu and vfio.

Sergio
Post by Pavel Fedin
So i think that we could easily use "--no-huge --shared-mem" combination. We could address hugepages compatibility problem later.
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Jianfeng Tan
2016-01-10 11:43:01 UTC
Permalink
Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.

NOTE: we now keep CONFIG_RTE_VIRTIO_VDEV undefined by default, need
to be uncommented when in use.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++++++++++++++++
drivers/net/virtio/vhost.h | 192 ++++++++++
drivers/net/virtio/virtio_ethdev.h | 5 +-
drivers/net/virtio/virtio_pci.h | 52 ++-
6 files changed, 990 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/virtio/vhost.c
create mode 100644 drivers/net/virtio/vhost.h

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..f76e162 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -534,3 +534,8 @@ CONFIG_RTE_APP_TEST=y
CONFIG_RTE_TEST_PMD=y
CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=n
CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=n
+
+#
+# Enable virtio support for container
+#
+CONFIG_RTE_VIRTIO_VDEV=y
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..0877023 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

+ifeq ($(CONFIG_RTE_VIRTIO_VDEV),y)
+ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += vhost.c
+endif
+
# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/vhost.c b/drivers/net/virtio/vhost.c
new file mode 100644
index 0000000..e423e02
--- /dev/null
+++ b/drivers/net/virtio/vhost.c
@@ -0,0 +1,734 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "vhost.h"
+
+static int
+vhost_user_write(int fd, void *buf, int len, int *fds, int fd_num)
+{
+ struct msghdr msgh;
+ struct iovec iov;
+ int r;
+
+ size_t fd_size = fd_num * sizeof(int);
+ char control[CMSG_SPACE(fd_size)];
+ struct cmsghdr *cmsg;
+
+ memset(&msgh, 0, sizeof(msgh));
+ memset(control, 0, sizeof(control));
+
+ iov.iov_base = (uint8_t *)buf;
+ iov.iov_len = len;
+
+ msgh.msg_iov = &iov;
+ msgh.msg_iovlen = 1;
+
+ msgh.msg_control = control;
+ msgh.msg_controllen = sizeof(control);
+
+ cmsg = CMSG_FIRSTHDR(&msgh);
+
+ cmsg->cmsg_len = CMSG_LEN(fd_size);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ memcpy(CMSG_DATA(cmsg), fds, fd_size);
+
+ do {
+ r = sendmsg(fd, &msgh, 0);
+ } while (r < 0 && errno == EINTR);
+
+ return r;
+}
+
+static int
+vhost_user_read(int fd, VhostUserMsg *msg)
+{
+ uint32_t valid_flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+ int ret, sz_hdr = VHOST_USER_HDR_SIZE, sz_payload;
+
+ ret = recv(fd, (void *)msg, sz_hdr, 0);
+ if (ret < sz_hdr) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg hdr: %d instead of %d.",
+ ret, sz_hdr);
+ goto fail;
+ }
+
+ /* validate msg flags */
+ if (msg->flags != (valid_flags)) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg: flags 0x%x instead of 0x%x.",
+ msg->flags, valid_flags);
+ goto fail;
+ }
+
+ sz_payload = msg->size;
+ if (sz_payload) {
+ ret = recv(fd, (void *)((uint8_t *)msg + sz_hdr), sz_payload, 0);
+ if (ret < sz_payload) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg payload: %d instead of %d.",
+ ret, msg->size);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+fail:
+ return -1;
+}
+
+static VhostUserMsg m __attribute__ ((unused));
+
+static void
+prepare_vhost_memory_user(VhostUserMsg *msg, int fds[])
+{
+ int i, num;
+ struct back_file *huges;
+ struct vhost_memory_region *mr;
+
+ num = rte_eal_get_backfile_info(&huges);
+
+ if (num > VHOST_MEMORY_MAX_NREGIONS)
+ rte_panic("%d hugepage files exceed the maximum of %d for "
+ "vhost-user\n", num, VHOST_MEMORY_MAX_NREGIONS);
+
+ for (i = 0; i < num; ++i) {
+ mr = &msg->payload.memory.regions[i];
+ mr->guest_phys_addr = (uint64_t)huges[i].addr; /* use vaddr! */
+ mr->userspace_addr = (uint64_t)huges[i].addr;
+ mr->memory_size = huges[i].size;
+ mr->mmap_offset = 0;
+ fds[i] = open(huges[i].filepath, O_RDWR);
+ }
+
+ msg->payload.memory.nregions = num;
+ msg->payload.memory.padding = 0;
+ free(huges);
+}
+
+static int
+vhost_user_sock(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ VhostUserMsg msg;
+ struct vhost_vring_file *file = 0;
+ int need_reply = 0;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+ int fd_num = 0;
+ int i, len;
+
+ msg.request = req;
+ msg.flags = VHOST_USER_VERSION;
+ msg.size = 0;
+
+ switch (req) {
+ case VHOST_USER_GET_FEATURES:
+ need_reply = 1;
+ break;
+
+ case VHOST_USER_SET_FEATURES:
+ case VHOST_USER_SET_LOG_BASE:
+ msg.payload.u64 = *((__u64 *)arg);
+ msg.size = sizeof(m.payload.u64);
+ break;
+
+ case VHOST_USER_SET_OWNER:
+ case VHOST_USER_RESET_OWNER:
+ break;
+
+ case VHOST_USER_SET_MEM_TABLE:
+ prepare_vhost_memory_user(&msg, fds);
+ fd_num = msg.payload.memory.nregions;
+ msg.size = sizeof(m.payload.memory.nregions);
+ msg.size += sizeof(m.payload.memory.padding);
+ msg.size += fd_num * sizeof(struct vhost_memory_region);
+ break;
+
+ case VHOST_USER_SET_LOG_FD:
+ fds[fd_num++] = *((int *)arg);
+ break;
+
+ case VHOST_USER_SET_VRING_NUM:
+ case VHOST_USER_SET_VRING_BASE:
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ break;
+
+ case VHOST_USER_GET_VRING_BASE:
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ need_reply = 1;
+ break;
+
+ case VHOST_USER_SET_VRING_ADDR:
+ memcpy(&msg.payload.addr, arg, sizeof(struct vhost_vring_addr));
+ msg.size = sizeof(m.payload.addr);
+ break;
+
+ case VHOST_USER_SET_VRING_KICK:
+ case VHOST_USER_SET_VRING_CALL:
+ case VHOST_USER_SET_VRING_ERR:
+ file = arg;
+ msg.payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK;
+ msg.size = sizeof(m.payload.u64);
+ if (file->fd > 0)
+ fds[fd_num++] = file->fd;
+ else
+ msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK;
+ break;
+
+ default:
+ PMD_DRV_LOG(ERR, "vhost-user trying to send unhandled msg type");
+ return -1;
+ }
+
+ len = VHOST_USER_HDR_SIZE + msg.size;
+ if (vhost_user_write(hw->vhostfd, &msg, len, fds, fd_num) < 0)
+ return 0;
+
+ if (req == VHOST_USER_SET_MEM_TABLE)
+ for (i = 0; i < fd_num; ++i)
+ close(fds[i]);
+
+ if (need_reply) {
+ if (vhost_user_read(hw->vhostfd, &msg) < 0)
+ return -1;
+
+ if (req != msg.request) {
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+
+ switch (req) {
+ case VHOST_USER_GET_FEATURES:
+ if (msg.size != sizeof(m.payload.u64)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ *((__u64 *)arg) = msg.payload.u64;
+ break;
+ case VHOST_USER_GET_VRING_BASE:
+ if (msg.size != sizeof(m.payload.state)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ memcpy(arg, &msg.payload.state, sizeof(struct vhost_vring_state));
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+static int
+vhost_kernel_ioctl(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ return ioctl(hw->vhostfd, req, arg);
+}
+
+enum {
+ VHOST_MSG_SET_OWNER,
+ VHOST_MSG_SET_FEATURES,
+ VHOST_MSG_GET_FEATURES,
+ VHOST_MSG_SET_VRING_CALL,
+ VHOST_MSG_SET_VRING_NUM,
+ VHOST_MSG_SET_VRING_BASE,
+ VHOST_MSG_SET_VRING_ADDR,
+ VHOST_MSG_SET_VRING_KICK,
+ VHOST_MSG_SET_MEM_TABLE,
+ VHOST_MSG_MAX,
+};
+
+static const char *vhost_msg_strings[] = {
+ "VHOST_MSG_SET_OWNER",
+ "VHOST_MSG_SET_FEATURES",
+ "VHOST_MSG_GET_FEATURES",
+ "VHOST_MSG_SET_VRING_CALL",
+ "VHOST_MSG_SET_VRING_NUM",
+ "VHOST_MSG_SET_VRING_BASE",
+ "VHOST_MSG_SET_VRING_ADDR",
+ "VHOST_MSG_SET_VRING_KICK",
+ "VHOST_MSG_SET_MEM_TABLE",
+ NULL,
+};
+
+static unsigned long int vhost_req_map[][2] = {
+ {VHOST_SET_OWNER, VHOST_USER_SET_OWNER},
+ {VHOST_SET_FEATURES, VHOST_USER_SET_FEATURES},
+ {VHOST_GET_FEATURES, VHOST_USER_GET_FEATURES},
+ {VHOST_SET_VRING_CALL, VHOST_USER_SET_VRING_CALL},
+ {VHOST_SET_VRING_NUM, VHOST_USER_SET_VRING_NUM},
+ {VHOST_SET_VRING_BASE, VHOST_USER_SET_VRING_BASE},
+ {VHOST_SET_VRING_ADDR, VHOST_USER_SET_VRING_ADDR},
+ {VHOST_SET_VRING_KICK, VHOST_USER_SET_VRING_KICK},
+ {VHOST_SET_MEM_TABLE, VHOST_USER_SET_MEM_TABLE},
+};
+
+static int
+vhost_call(struct virtio_hw *hw, unsigned long int req_orig, void *arg)
+{
+ int req_new;
+ int ret = 0;
+
+ if (req_orig >= VHOST_MSG_MAX)
+ rte_panic("invalid req: %lu\n", req_orig);
+
+ PMD_DRV_LOG(INFO, "%s\n", vhost_msg_strings[req_orig]);
+ req_new = vhost_req_map[req_orig][hw->type];
+ if (hw->type == VHOST_USER)
+ ret = vhost_user_sock(hw, req_new, arg);
+ else
+ ret = vhost_kernel_ioctl(hw, req_new, arg);
+
+ if (ret < 0)
+ rte_panic("vhost_call %s failed: %s\n",
+ vhost_msg_strings[req_orig], strerror(errno));
+
+ return ret;
+}
+
+static void
+kick_one_vq(struct virtio_hw *hw, struct virtqueue *vq, unsigned queue_sel)
+{
+ int callfd, kickfd;
+ struct vhost_vring_file file;
+
+ /* or use invalid flag to disable it, but vhost-dpdk uses this to judge
+ * if dev is alive. so finally we need two real event_fds.
+ */
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_CALL comes
+ * firstly because vhost depends on this msg to allocate virtqueue
+ * pair.
+ */
+ callfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (callfd < 0)
+ rte_panic("callfd error, %s\n", strerror(errno));
+
+ file.index = queue_sel;
+ file.fd = callfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_CALL, &file);
+ hw->callfds[queue_sel] = callfd;
+
+ struct vhost_vring_state state;
+ state.index = queue_sel;
+ state.num = vq->vq_ring.num;
+ vhost_call(hw, VHOST_MSG_SET_VRING_NUM, &state);
+
+ state.num = 0; /* no reservation */
+ vhost_call(hw, VHOST_MSG_SET_VRING_BASE, &state);
+
+ struct vhost_vring_addr addr = {
+ .index = queue_sel,
+ .desc_user_addr = (uint64_t)vq->vq_ring.desc,
+ .avail_user_addr = (uint64_t)vq->vq_ring.avail,
+ .used_user_addr = (uint64_t)vq->vq_ring.used,
+ .log_guest_addr = 0,
+ .flags = 0, /* disable log */
+ };
+ vhost_call(hw, VHOST_MSG_SET_VRING_ADDR, &addr);
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_KICK comes
+ * lastly because vhost depends on this msg to judge if
+ * virtio_is_ready().
+ */
+
+ kickfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (kickfd < 0)
+ rte_panic("kickfd error, %s\n", strerror(errno));
+
+ file.fd = kickfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_KICK, &file);
+ hw->kickfds[queue_sel] = kickfd;
+}
+
+/**
+ * Merge those virtually adjacent memsegs into one region.
+ */
+static void
+prepare_vhost_memory_kernel(struct vhost_memory_kernel **p_vm)
+{
+ unsigned i, j, k = 0;
+ struct rte_memseg *seg;
+ struct vhost_memory_region *mr;
+ struct vhost_memory_kernel *vm;
+
+ vm = malloc(sizeof(struct vhost_memory_kernel)
+ + RTE_MAX_MEMSEG * sizeof(struct vhost_memory_region));
+
+ for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
+ seg = &rte_eal_get_configuration()->mem_config->memseg[i];
+ if (seg->addr == NULL)
+ break;
+
+ int new_region = 1;
+ for (j = 0; j < k; ++j) {
+ mr = &vm->regions[j];
+
+ if (mr->userspace_addr + mr->memory_size
+ == (uint64_t)seg->addr) {
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+
+ if ((uint64_t)seg->addr + seg->len
+ == mr->userspace_addr) {
+ mr->guest_phys_addr = (uint64_t)seg->addr;
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+ }
+
+ if (new_region == 0)
+ continue;
+
+ mr = &vm->regions[k++];
+ mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size = seg->len;
+ mr->mmap_offset = 0;
+ }
+
+ vm->nregions = k;
+ vm->padding = 0;
+ *p_vm = vm;
+}
+
+static void kick_all_vq(struct virtio_hw *hw)
+{
+ int ret;
+ unsigned i, queue_sel, nvqs;
+ struct rte_eth_dev_data *data = hw->data;
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_memory_kernel *vm = NULL;
+ prepare_vhost_memory_kernel(&vm);
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, vm);
+ free(vm);
+ } else {
+ /* construct vhost_memory inside prepare_vhost_memory_user() */
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, NULL);
+ }
+
+ for (i = 0; i < data->nb_rx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_RQ_QUEUE_IDX;
+ kick_one_vq(hw, data->rx_queues[i], queue_sel);
+ }
+ for (i = 0; i < data->nb_tx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_TQ_QUEUE_IDX;
+ kick_one_vq(hw, data->tx_queues[i], queue_sel);
+ }
+
+ /* after setup all virtqueues, we need to set_features again
+ * so that these features can be set into each virtqueue in
+ * vhost side.
+ */
+ uint64_t features = hw->guest_features;
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+ if (ioctl(hw->backfd, TUNSETVNETHDRSZ, &hw->vtnet_hdr_size) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+ PMD_DRV_LOG(INFO, "set features:%"PRIx64"\n", features);
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_vring_file file;
+
+ file.fd = hw->backfd;
+ nvqs = data->nb_rx_queues + data->nb_tx_queues;
+ for (file.index = 0; file.index < nvqs; ++file.index) {
+ ret = vhost_kernel_ioctl(hw, VHOST_NET_SET_BACKEND, &file);
+ if (ret < 0)
+ rte_panic("VHOST_NET_SET_BACKEND failed, %s\n",
+ strerror(errno));
+ }
+ }
+
+ /* TODO: VHOST_SET_LOG_BASE */
+}
+
+void
+virtio_ioport_write(struct virtio_hw *hw, uint64_t addr, uint32_t val)
+{
+ uint64_t guest_features;
+
+ switch (addr) {
+ case VIRTIO_PCI_GUEST_FEATURES:
+ guest_features = val;
+ guest_features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &guest_features);
+ break;
+ case VIRTIO_PCI_QUEUE_PFN:
+ /* do nothing */
+ break;
+ case VIRTIO_PCI_QUEUE_SEL:
+ hw->queue_sel = val;
+ break;
+ case VIRTIO_PCI_STATUS:
+ if (val & VIRTIO_CONFIG_S_DRIVER_OK)
+ kick_all_vq(hw);
+ hw->status = val & 0xFF;
+ break;
+ case VIRTIO_PCI_QUEUE_NOTIFY:
+ {
+ int ret;
+ uint64_t buf = 1;
+ ret = write(hw->kickfds[val], &buf, sizeof(uint64_t));
+ if (ret == -1)
+ rte_panic("VIRTIO_PCI_QUEUE_NOTIFY failed: %s\n",
+ strerror(errno));
+ break;
+ }
+ default:
+ PMD_DRV_LOG(ERR, "unexpected address %"PRIu64" value 0x%x\n",
+ addr, val);
+ break;
+ }
+}
+
+uint32_t
+virtio_ioport_read(struct virtio_hw *hw, uint64_t addr)
+{
+ uint32_t ret = 0xFFFFFFFF;
+ uint64_t host_features;
+
+ PMD_DRV_LOG(INFO, "addr: %"PRIu64"\n", addr);
+
+ switch (addr) {
+ case VIRTIO_PCI_HOST_FEATURES:
+ vhost_call(hw, VHOST_MSG_GET_FEATURES, &host_features);
+ PMD_DRV_LOG(INFO, "get_features: %"PRIx64"\n", host_features);
+ if (hw->mac_specified)
+ host_features |= (1ull << VIRTIO_NET_F_MAC);
+ /* disable it until we support CQ */
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_RX);
+ ret = host_features;
+ break;
+ case VIRTIO_PCI_GUEST_FEATURES:
+ ret = hw->guest_features;
+ break;
+ case VIRTIO_PCI_QUEUE_NUM:
+ ret = hw->queue_num;
+ break;
+ case VIRTIO_PCI_QUEUE_SEL:
+ ret = hw->queue_sel;
+ break;
+ case VIRTIO_PCI_STATUS:
+ ret = hw->status;
+ break;
+ case 20: /* mac addr: 0~3 */
+ if (hw->mac_specified) {
+ uint32_t m0 = hw->mac_addr[0],
+ m1 = hw->mac_addr[1],
+ m2 = hw->mac_addr[2],
+ m3 = hw->mac_addr[3];
+ ret = (m3 << 24) | (m2 << 16) | (m1 << 8) | m0;
+ }
+ break;
+ case 24: /* mac addr: 4~5 */
+ if (hw->mac_specified) {
+ uint32_t m4 = hw->mac_addr[4],
+ m5 = hw->mac_addr[5];
+ ret = (m5 << 8) | m4;
+ }
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "%"PRIu64" (r) not supported\n", addr);
+ break;
+ }
+
+ return ret;
+}
+
+#define TUN_DEF_SNDBUF (1ull << 20)
+
+static void
+vhost_kernel_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int len = sizeof(struct virtio_net_hdr);
+ int req_mq = 0;
+ int sndbuf = TUN_DEF_SNDBUF;
+ unsigned int features;
+ struct ifreq ifr;
+
+ /* TODO:
+ * 1. get and set offload capability, tap_probe_has_ufo, tap_fd_set_offload
+ * 2. verify we can get and set vnet_hdr_len, tap_probe_vnet_hdr_len
+
+ * 1. get number of memory regions from vhost module parameter
+ * max_mem_regions, supported in newer version linux kernel
+ */
+
+ fd = open(PATH_NET_TUN, O_RDWR);
+ if (fd < 0)
+ rte_panic("open %s error, %s\n", PATH_NET_TUN, strerror(errno));
+
+ memset(&ifr, 0, sizeof(ifr));
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+ if (ioctl(fd, TUNGETFEATURES, &features) == -1)
+ rte_panic("TUNGETFEATURES failed: %s", strerror(errno));
+
+ if (features & IFF_ONE_QUEUE)
+ ifr.ifr_flags |= IFF_ONE_QUEUE;
+
+ if (features & IFF_VNET_HDR)
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ else
+ rte_panic("vnet_hdr requested, but kernel does not support\n");
+
+ if (req_mq) {
+ if (features & IFF_MULTI_QUEUE)
+ ifr.ifr_flags |= IFF_MULTI_QUEUE;
+ else
+ rte_panic("multiqueue requested, but kernel does not support\n");
+ }
+
+ strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
+ if (ioctl(fd, TUNSETIFF, (void *) &ifr) == -1)
+ rte_panic("TUNSETIFF failed: %s", strerror(errno));
+ fcntl(fd, F_SETFL, O_NONBLOCK);
+
+ if (ioctl(fd, TUNSETVNETHDRSZ, &len) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+
+ if (ioctl(fd, TUNSETSNDBUF, &sndbuf) == -1)
+ rte_panic("TUNSETSNDBUF failed: %s", strerror(errno));
+
+ hw->backfd = fd;
+
+ hw->vhostfd = open(hw->path, O_RDWR);
+ if (hw->vhostfd == -1)
+ rte_panic("open %s failed: %s\n", hw->path, strerror(errno));
+}
+
+static void
+vhost_user_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int flag;
+ struct sockaddr_un un;
+
+ fd = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (fd < 0)
+ rte_panic("socket error, %s\n", strerror(errno));
+
+ flag = fcntl(fd, F_GETFD);
+ fcntl(fd, F_SETFD, flag | FD_CLOEXEC);
+
+ memset(&un, 0, sizeof(un));
+ un.sun_family = AF_UNIX;
+ snprintf(un.sun_path, sizeof(un.sun_path), "%s", hw->path);
+ if (connect(fd, (struct sockaddr *)&un, sizeof(un)) < 0) {
+ PMD_DRV_LOG(ERR, "connect error, %s\n", strerror(errno));
+ exit(-1);
+ }
+
+ hw->vhostfd = fd;
+}
+
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac)
+{
+ int i;
+ int ret;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ ret = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0], &tmp[1],
+ &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (ret == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ }
+ }
+
+ /* TODO: cq */
+
+ ret = stat(hw->path, &s);
+ if (ret < 0)
+ rte_panic("stat: %s failed, %s\n", hw->path, strerror(errno));
+
+ switch (s.st_mode & S_IFMT) {
+ case S_IFCHR:
+ hw->type = VHOST_KERNEL;
+ vhost_kernel_backend_setup(hw);
+ break;
+ case S_IFSOCK:
+ hw->type = VHOST_USER;
+ vhost_user_backend_setup(hw);
+ break;
+ default:
+ rte_panic("unknown file type of %s\n", hw->path);
+ }
+ if (vhost_call(hw, VHOST_MSG_SET_OWNER, NULL) == -1)
+ rte_panic("vhost set_owner failed: %s\n", strerror(errno));
+}
diff --git a/drivers/net/virtio/vhost.h b/drivers/net/virtio/vhost.h
new file mode 100644
index 0000000..c7517f6
--- /dev/null
+++ b/drivers/net/virtio/vhost.h
@@ -0,0 +1,192 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _VHOST_NET_USER_H
+#define _VHOST_NET_USER_H
+
+#include <stdint.h>
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VHOST_MEMORY_MAX_NREGIONS 8
+
+struct vhost_vring_state {
+ unsigned int index;
+ unsigned int num;
+};
+
+struct vhost_vring_file {
+ unsigned int index;
+ int fd;
+};
+
+struct vhost_vring_addr {
+ unsigned int index;
+ /* Option flags. */
+ unsigned int flags;
+ /* Flag values: */
+ /* Whether log address is valid. If set enables logging. */
+#define VHOST_VRING_F_LOG 0
+
+ /* Start of array of descriptors (virtually contiguous) */
+ uint64_t desc_user_addr;
+ /* Used structure address. Must be 32 bit aligned */
+ uint64_t used_user_addr;
+ /* Available structure address. Must be 16 bit aligned */
+ uint64_t avail_user_addr;
+ /* Logging support. */
+ /* Log writes to used structure, at offset calculated from specified
+ * address. Address must be 32 bit aligned. */
+ uint64_t log_guest_addr;
+};
+
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+
+typedef enum VhostUserRequest {
+ VHOST_USER_NONE = 0,
+ VHOST_USER_GET_FEATURES = 1,
+ VHOST_USER_SET_FEATURES = 2,
+ VHOST_USER_SET_OWNER = 3,
+ VHOST_USER_RESET_OWNER = 4,
+ VHOST_USER_SET_MEM_TABLE = 5,
+ VHOST_USER_SET_LOG_BASE = 6,
+ VHOST_USER_SET_LOG_FD = 7,
+ VHOST_USER_SET_VRING_NUM = 8,
+ VHOST_USER_SET_VRING_ADDR = 9,
+ VHOST_USER_SET_VRING_BASE = 10,
+ VHOST_USER_GET_VRING_BASE = 11,
+ VHOST_USER_SET_VRING_KICK = 12,
+ VHOST_USER_SET_VRING_CALL = 13,
+ VHOST_USER_SET_VRING_ERR = 14,
+ VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+ VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+ VHOST_USER_GET_QUEUE_NUM = 17,
+ VHOST_USER_SET_VRING_ENABLE = 18,
+ VHOST_USER_MAX
+} VhostUserRequest;
+
+struct vhost_memory_region {
+ uint64_t guest_phys_addr;
+ uint64_t memory_size; /* bytes */
+ uint64_t userspace_addr;
+ uint64_t mmap_offset;
+};
+struct vhost_memory_kernel {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[0];
+};
+
+struct vhost_memory {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[VHOST_MEMORY_MAX_NREGIONS];
+};
+
+typedef struct VhostUserMsg {
+ VhostUserRequest request;
+
+#define VHOST_USER_VERSION_MASK 0x3
+#define VHOST_USER_REPLY_MASK (0x1 << 2)
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+#define VHOST_USER_VRING_IDX_MASK 0xff
+#define VHOST_USER_VRING_NOFD_MASK (0x1 << 8)
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ struct vhost_memory memory;
+ } payload;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+} __attribute((packed)) VhostUserMsg;
+
+#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
+#define VHOST_USER_PAYLOAD_SIZE (sizeof(VhostUserMsg) - VHOST_USER_HDR_SIZE)
+
+/* The version of the protocol we support */
+#define VHOST_USER_VERSION 0x1
+
+/* ioctls */
+
+#define VHOST_VIRTIO 0xAF
+
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
+#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
+#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+/*****************************************************************************/
+
+/* Ioctl defines */
+#define TUNSETIFF _IOW('T', 202, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
+#define TUNSETOFFLOAD _IOW('T', 208, unsigned int)
+#define TUNGETIFF _IOR('T', 210, unsigned int)
+#define TUNSETSNDBUF _IOW('T', 212, int)
+#define TUNGETVNETHDRSZ _IOR('T', 215, int)
+#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE _IOW('T', 217, int)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNSETVNETBE _IOW('T', 222, int)
+
+/* TUNSETIFF ifr flags */
+#define IFF_TAP 0x0002
+#define IFF_NO_PI 0x1000
+#define IFF_ONE_QUEUE 0x2000
+#define IFF_VNET_HDR 0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
+
+/* Features for GSO (TUNSETOFFLOAD). */
+#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
+#define TUN_F_TSO4 0x02 /* I can handle TSO for IPv4 packets */
+#define TUN_F_TSO6 0x04 /* I can handle TSO for IPv6 packets */
+#define TUN_F_TSO_ECN 0x08 /* I can handle TSO with ECN bits. */
+#define TUN_F_UFO 0x10 /* I can handle UFO packets */
+
+#define PATH_NET_TUN "/dev/net/tun"
+
+#endif
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index ae2d47d..9e1ecb3 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -122,5 +122,8 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)

-
+#ifdef RTE_VIRTIO_VDEV
+void virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq, int queue_num, char *mac);
+#endif
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 47f722a..af05ae2 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -147,7 +147,6 @@ struct virtqueue;
* rest are per-device feature bits.
*/
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32

/* The Guest publishes the used index for which it expects an interrupt
* at the end of the avail ring. Host should ignore the avail->flags field. */
@@ -165,6 +164,7 @@ struct virtqueue;

struct virtio_hw {
struct virtqueue *cvq;
+#define VIRTIO_VDEV_IO_BASE 0xffffffff
uint32_t io_base;
uint32_t guest_features;
uint32_t max_tx_queues;
@@ -174,6 +174,21 @@ struct virtio_hw {
uint8_t use_msix;
uint8_t started;
uint8_t mac_addr[ETHER_ADDR_LEN];
+#ifdef RTE_VIRTIO_VDEV
+#define VHOST_KERNEL 0
+#define VHOST_USER 1
+ int type; /* type of backend */
+ uint32_t queue_num;
+ char *path;
+ int mac_specified;
+ int vhostfd;
+ int backfd; /* tap device used in vhost-net */
+ int callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ int kickfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ uint32_t queue_sel;
+ uint8_t status;
+ struct rte_eth_dev_data *data;
+#endif
};

/*
@@ -229,6 +244,39 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_PCI_REG_ADDR(hw, reg) \
(unsigned short)((hw)->io_base + (reg))

+#ifdef RTE_VIRTIO_VDEV
+uint32_t virtio_ioport_read(struct virtio_hw *, uint64_t);
+void virtio_ioport_write(struct virtio_hw *, uint64_t, uint32_t);
+
+#define VIRTIO_READ_REG_1(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inb((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_1(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outb_p((unsigned char)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#define VIRTIO_READ_REG_2(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inw((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_2(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outw_p((unsigned short)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#define VIRTIO_READ_REG_4(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inl((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_4(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#else /* RTE_VIRTIO_VDEV */
+
#define VIRTIO_READ_REG_1(hw, reg) \
inb((VIRTIO_PCI_REG_ADDR((hw), (reg))))
#define VIRTIO_WRITE_REG_1(hw, reg, value) \
@@ -244,6 +292,8 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg))))

+#endif /* RTE_VIRTIO_VDEV */
+
static inline int
vtpci_with_feature(struct virtio_hw *hw, uint32_t bit)
{
--
2.1.4
Pavel Fedin
2016-01-11 10:42:21 UTC
Permalink
Hello!

Please, see inline
-----Original Message-----
Sent: Sunday, January 10, 2016 2:43 PM
Subject: [PATCH 3/4] virtio/vdev: add ways to interact with vhost
Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.
NOTE: we now keep CONFIG_RTE_VIRTIO_VDEV undefined by default, need
to be uncommented when in use.
---
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++++++++++++++++
drivers/net/virtio/vhost.h | 192 ++++++++++
drivers/net/virtio/virtio_ethdev.h | 5 +-
drivers/net/virtio/virtio_pci.h | 52 ++-
6 files changed, 990 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/virtio/vhost.c
create mode 100644 drivers/net/virtio/vhost.h
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..f76e162 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -534,3 +534,8 @@ CONFIG_RTE_APP_TEST=y
CONFIG_RTE_TEST_PMD=y
CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=n
CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=n
+
+#
+# Enable virtio support for container
+#
+CONFIG_RTE_VIRTIO_VDEV=y
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..0877023 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
+ifeq ($(CONFIG_RTE_VIRTIO_VDEV),y)
+ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += vhost.c
+endif
+
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/vhost.c b/drivers/net/virtio/vhost.c
new file mode 100644
index 0000000..e423e02
--- /dev/null
+++ b/drivers/net/virtio/vhost.c
@@ -0,0 +1,734 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "vhost.h"
+
+static int
+vhost_user_write(int fd, void *buf, int len, int *fds, int fd_num)
+{
+ struct msghdr msgh;
+ struct iovec iov;
+ int r;
+
+ size_t fd_size = fd_num * sizeof(int);
+ char control[CMSG_SPACE(fd_size)];
+ struct cmsghdr *cmsg;
+
+ memset(&msgh, 0, sizeof(msgh));
+ memset(control, 0, sizeof(control));
+
+ iov.iov_base = (uint8_t *)buf;
+ iov.iov_len = len;
+
+ msgh.msg_iov = &iov;
+ msgh.msg_iovlen = 1;
+
+ msgh.msg_control = control;
+ msgh.msg_controllen = sizeof(control);
+
+ cmsg = CMSG_FIRSTHDR(&msgh);
+
+ cmsg->cmsg_len = CMSG_LEN(fd_size);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ memcpy(CMSG_DATA(cmsg), fds, fd_size);
+
+ do {
+ r = sendmsg(fd, &msgh, 0);
+ } while (r < 0 && errno == EINTR);
+
+ return r;
+}
+
+static int
+vhost_user_read(int fd, VhostUserMsg *msg)
+{
+ uint32_t valid_flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+ int ret, sz_hdr = VHOST_USER_HDR_SIZE, sz_payload;
+
+ ret = recv(fd, (void *)msg, sz_hdr, 0);
+ if (ret < sz_hdr) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg hdr: %d instead of %d.",
+ ret, sz_hdr);
+ goto fail;
+ }
+
+ /* validate msg flags */
+ if (msg->flags != (valid_flags)) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg: flags 0x%x instead of 0x%x.",
+ msg->flags, valid_flags);
+ goto fail;
+ }
+
+ sz_payload = msg->size;
+ if (sz_payload) {
+ ret = recv(fd, (void *)((uint8_t *)msg + sz_hdr), sz_payload, 0);
+ if (ret < sz_payload) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg payload: %d instead of %d.",
+ ret, msg->size);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+ return -1;
+}
+
+static VhostUserMsg m __attribute__ ((unused));
+
+static void
+prepare_vhost_memory_user(VhostUserMsg *msg, int fds[])
+{
+ int i, num;
+ struct back_file *huges;
+ struct vhost_memory_region *mr;
+
+ num = rte_eal_get_backfile_info(&huges);
+
+ if (num > VHOST_MEMORY_MAX_NREGIONS)
+ rte_panic("%d hugepage files exceed the maximum of %d for "
+ "vhost-user\n", num, VHOST_MEMORY_MAX_NREGIONS);
+
+ for (i = 0; i < num; ++i) {
+ mr = &msg->payload.memory.regions[i];
+ mr->guest_phys_addr = (uint64_t)huges[i].addr; /* use vaddr! */
+ mr->userspace_addr = (uint64_t)huges[i].addr;
+ mr->memory_size = huges[i].size;
+ mr->mmap_offset = 0;
+ fds[i] = open(huges[i].filepath, O_RDWR);
+ }
+
+ msg->payload.memory.nregions = num;
+ msg->payload.memory.padding = 0;
+ free(huges);
+}
+
+static int
+vhost_user_sock(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ VhostUserMsg msg;
+ struct vhost_vring_file *file = 0;
+ int need_reply = 0;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+ int fd_num = 0;
+ int i, len;
+
+ msg.request = req;
+ msg.flags = VHOST_USER_VERSION;
+ msg.size = 0;
+
+ switch (req) {
+ need_reply = 1;
+ break;
+
+ msg.payload.u64 = *((__u64 *)arg);
+ msg.size = sizeof(m.payload.u64);
+ break;
+
+ break;
+
+ prepare_vhost_memory_user(&msg, fds);
+ fd_num = msg.payload.memory.nregions;
+ msg.size = sizeof(m.payload.memory.nregions);
+ msg.size += sizeof(m.payload.memory.padding);
+ msg.size += fd_num * sizeof(struct vhost_memory_region);
+ break;
+
+ fds[fd_num++] = *((int *)arg);
+ break;
+
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ break;
+
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ need_reply = 1;
+ break;
+
+ memcpy(&msg.payload.addr, arg, sizeof(struct vhost_vring_addr));
+ msg.size = sizeof(m.payload.addr);
+ break;
+
+ file = arg;
+ msg.payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK;
+ msg.size = sizeof(m.payload.u64);
+ if (file->fd > 0)
+ fds[fd_num++] = file->fd;
+ else
+ msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK;
+ break;
+
+ PMD_DRV_LOG(ERR, "vhost-user trying to send unhandled msg type");
+ return -1;
+ }
+
+ len = VHOST_USER_HDR_SIZE + msg.size;
+ if (vhost_user_write(hw->vhostfd, &msg, len, fds, fd_num) < 0)
+ return 0;
+
+ if (req == VHOST_USER_SET_MEM_TABLE)
+ for (i = 0; i < fd_num; ++i)
+ close(fds[i]);
+
+ if (need_reply) {
+ if (vhost_user_read(hw->vhostfd, &msg) < 0)
+ return -1;
+
+ if (req != msg.request) {
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+
+ switch (req) {
+ if (msg.size != sizeof(m.payload.u64)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ *((__u64 *)arg) = msg.payload.u64;
+ break;
+ if (msg.size != sizeof(m.payload.state)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ memcpy(arg, &msg.payload.state, sizeof(struct vhost_vring_state));
+ break;
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+static int
+vhost_kernel_ioctl(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ return ioctl(hw->vhostfd, req, arg);
+}
+
+enum {
+ VHOST_MSG_SET_OWNER,
+ VHOST_MSG_SET_FEATURES,
+ VHOST_MSG_GET_FEATURES,
+ VHOST_MSG_SET_VRING_CALL,
+ VHOST_MSG_SET_VRING_NUM,
+ VHOST_MSG_SET_VRING_BASE,
+ VHOST_MSG_SET_VRING_ADDR,
+ VHOST_MSG_SET_VRING_KICK,
+ VHOST_MSG_SET_MEM_TABLE,
+ VHOST_MSG_MAX,
+};
+
+static const char *vhost_msg_strings[] = {
+ "VHOST_MSG_SET_OWNER",
+ "VHOST_MSG_SET_FEATURES",
+ "VHOST_MSG_GET_FEATURES",
+ "VHOST_MSG_SET_VRING_CALL",
+ "VHOST_MSG_SET_VRING_NUM",
+ "VHOST_MSG_SET_VRING_BASE",
+ "VHOST_MSG_SET_VRING_ADDR",
+ "VHOST_MSG_SET_VRING_KICK",
+ "VHOST_MSG_SET_MEM_TABLE",
+ NULL,
+};
+
+static unsigned long int vhost_req_map[][2] = {
+ {VHOST_SET_OWNER, VHOST_USER_SET_OWNER},
+ {VHOST_SET_FEATURES, VHOST_USER_SET_FEATURES},
+ {VHOST_GET_FEATURES, VHOST_USER_GET_FEATURES},
+ {VHOST_SET_VRING_CALL, VHOST_USER_SET_VRING_CALL},
+ {VHOST_SET_VRING_NUM, VHOST_USER_SET_VRING_NUM},
+ {VHOST_SET_VRING_BASE, VHOST_USER_SET_VRING_BASE},
+ {VHOST_SET_VRING_ADDR, VHOST_USER_SET_VRING_ADDR},
+ {VHOST_SET_VRING_KICK, VHOST_USER_SET_VRING_KICK},
+ {VHOST_SET_MEM_TABLE, VHOST_USER_SET_MEM_TABLE},
+};
+
+static int
+vhost_call(struct virtio_hw *hw, unsigned long int req_orig, void *arg)
+{
+ int req_new;
+ int ret = 0;
+
+ if (req_orig >= VHOST_MSG_MAX)
+ rte_panic("invalid req: %lu\n", req_orig);
+
+ PMD_DRV_LOG(INFO, "%s\n", vhost_msg_strings[req_orig]);
+ req_new = vhost_req_map[req_orig][hw->type];
+ if (hw->type == VHOST_USER)
+ ret = vhost_user_sock(hw, req_new, arg);
+ else
+ ret = vhost_kernel_ioctl(hw, req_new, arg);
+
+ if (ret < 0)
+ rte_panic("vhost_call %s failed: %s\n",
+ vhost_msg_strings[req_orig], strerror(errno));
+
+ return ret;
+}
+
+static void
+kick_one_vq(struct virtio_hw *hw, struct virtqueue *vq, unsigned queue_sel)
+{
+ int callfd, kickfd;
+ struct vhost_vring_file file;
+
+ /* or use invalid flag to disable it, but vhost-dpdk uses this to judge
+ * if dev is alive. so finally we need two real event_fds.
+ */
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_CALL comes
+ * firstly because vhost depends on this msg to allocate virtqueue
+ * pair.
+ */
+ callfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (callfd < 0)
+ rte_panic("callfd error, %s\n", strerror(errno));
+
+ file.index = queue_sel;
+ file.fd = callfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_CALL, &file);
+ hw->callfds[queue_sel] = callfd;
+
+ struct vhost_vring_state state;
+ state.index = queue_sel;
+ state.num = vq->vq_ring.num;
+ vhost_call(hw, VHOST_MSG_SET_VRING_NUM, &state);
+
+ state.num = 0; /* no reservation */
+ vhost_call(hw, VHOST_MSG_SET_VRING_BASE, &state);
+
+ struct vhost_vring_addr addr = {
+ .index = queue_sel,
+ .desc_user_addr = (uint64_t)vq->vq_ring.desc,
+ .avail_user_addr = (uint64_t)vq->vq_ring.avail,
+ .used_user_addr = (uint64_t)vq->vq_ring.used,
+ .log_guest_addr = 0,
+ .flags = 0, /* disable log */
+ };
+ vhost_call(hw, VHOST_MSG_SET_VRING_ADDR, &addr);
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_KICK comes
+ * lastly because vhost depends on this msg to judge if
+ * virtio_is_ready().
+ */
+
+ kickfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (kickfd < 0)
+ rte_panic("kickfd error, %s\n", strerror(errno));
+
+ file.fd = kickfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_KICK, &file);
+ hw->kickfds[queue_sel] = kickfd;
+}
+
+/**
+ * Merge those virtually adjacent memsegs into one region.
+ */
+static void
+prepare_vhost_memory_kernel(struct vhost_memory_kernel **p_vm)
+{
+ unsigned i, j, k = 0;
+ struct rte_memseg *seg;
+ struct vhost_memory_region *mr;
+ struct vhost_memory_kernel *vm;
+
+ vm = malloc(sizeof(struct vhost_memory_kernel)
+ + RTE_MAX_MEMSEG * sizeof(struct vhost_memory_region));
+
+ for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
+ seg = &rte_eal_get_configuration()->mem_config->memseg[i];
+ if (seg->addr == NULL)
+ break;
+
+ int new_region = 1;
+ for (j = 0; j < k; ++j) {
+ mr = &vm->regions[j];
+
+ if (mr->userspace_addr + mr->memory_size
+ == (uint64_t)seg->addr) {
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+
+ if ((uint64_t)seg->addr + seg->len
+ == mr->userspace_addr) {
+ mr->guest_phys_addr = (uint64_t)seg->addr;
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+ }
+
+ if (new_region == 0)
+ continue;
+
+ mr = &vm->regions[k++];
+ mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size = seg->len;
+ mr->mmap_offset = 0;
+ }
+
+ vm->nregions = k;
+ vm->padding = 0;
+ *p_vm = vm;
+}
+
+static void kick_all_vq(struct virtio_hw *hw)
+{
+ int ret;
+ unsigned i, queue_sel, nvqs;
+ struct rte_eth_dev_data *data = hw->data;
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_memory_kernel *vm = NULL;
+ prepare_vhost_memory_kernel(&vm);
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, vm);
+ free(vm);
+ } else {
+ /* construct vhost_memory inside prepare_vhost_memory_user() */
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, NULL);
+ }
+
+ for (i = 0; i < data->nb_rx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_RQ_QUEUE_IDX;
+ kick_one_vq(hw, data->rx_queues[i], queue_sel);
+ }
+ for (i = 0; i < data->nb_tx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_TQ_QUEUE_IDX;
+ kick_one_vq(hw, data->tx_queues[i], queue_sel);
+ }
+
+ /* after setup all virtqueues, we need to set_features again
+ * so that these features can be set into each virtqueue in
+ * vhost side.
+ */
+ uint64_t features = hw->guest_features;
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+ if (ioctl(hw->backfd, TUNSETVNETHDRSZ, &hw->vtnet_hdr_size) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+ PMD_DRV_LOG(INFO, "set features:%"PRIx64"\n", features);
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_vring_file file;
+
+ file.fd = hw->backfd;
+ nvqs = data->nb_rx_queues + data->nb_tx_queues;
+ for (file.index = 0; file.index < nvqs; ++file.index) {
+ ret = vhost_kernel_ioctl(hw, VHOST_NET_SET_BACKEND, &file);
+ if (ret < 0)
+ rte_panic("VHOST_NET_SET_BACKEND failed, %s\n",
+ strerror(errno));
+ }
+ }
+
+ /* TODO: VHOST_SET_LOG_BASE */
+}
+
+void
+virtio_ioport_write(struct virtio_hw *hw, uint64_t addr, uint32_t val)
+{
+ uint64_t guest_features;
+
+ switch (addr) {
+ guest_features = val;
+ guest_features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &guest_features);
+ break;
+ /* do nothing */
+ break;
+ hw->queue_sel = val;
+ break;
+ if (val & VIRTIO_CONFIG_S_DRIVER_OK)
+ kick_all_vq(hw);
+ hw->status = val & 0xFF;
+ break;
+ {
+ int ret;
+ uint64_t buf = 1;
+ ret = write(hw->kickfds[val], &buf, sizeof(uint64_t));
+ if (ret == -1)
+ rte_panic("VIRTIO_PCI_QUEUE_NOTIFY failed: %s\n",
+ strerror(errno));
+ break;
+ }
+ PMD_DRV_LOG(ERR, "unexpected address %"PRIu64" value 0x%x\n",
+ addr, val);
+ break;
+ }
+}
+
+uint32_t
+virtio_ioport_read(struct virtio_hw *hw, uint64_t addr)
+{
+ uint32_t ret = 0xFFFFFFFF;
+ uint64_t host_features;
+
+ PMD_DRV_LOG(INFO, "addr: %"PRIu64"\n", addr);
+
+ switch (addr) {
+ vhost_call(hw, VHOST_MSG_GET_FEATURES, &host_features);
+ PMD_DRV_LOG(INFO, "get_features: %"PRIx64"\n", host_features);
+ if (hw->mac_specified)
+ host_features |= (1ull << VIRTIO_NET_F_MAC);
+ /* disable it until we support CQ */
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_RX);
+ ret = host_features;
+ break;
+ ret = hw->guest_features;
+ break;
+ ret = hw->queue_num;
+ break;
+ ret = hw->queue_sel;
+ break;
+ ret = hw->status;
+ break;
+ case 20: /* mac addr: 0~3 */
+ if (hw->mac_specified) {
+ uint32_t m0 = hw->mac_addr[0],
+ m1 = hw->mac_addr[1],
+ m2 = hw->mac_addr[2],
+ m3 = hw->mac_addr[3];
+ ret = (m3 << 24) | (m2 << 16) | (m1 << 8) | m0;
+ }
+ break;
+ case 24: /* mac addr: 4~5 */
+ if (hw->mac_specified) {
+ uint32_t m4 = hw->mac_addr[4],
+ m5 = hw->mac_addr[5];
+ ret = (m5 << 8) | m4;
+ }
+ break;
+ PMD_DRV_LOG(ERR, "%"PRIu64" (r) not supported\n", addr);
+ break;
+ }
+
+ return ret;
+}
+
+#define TUN_DEF_SNDBUF (1ull << 20)
+
+static void
+vhost_kernel_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int len = sizeof(struct virtio_net_hdr);
+ int req_mq = 0;
+ int sndbuf = TUN_DEF_SNDBUF;
+ unsigned int features;
+ struct ifreq ifr;
+
+ * 1. get and set offload capability, tap_probe_has_ufo, tap_fd_set_offload
+ * 2. verify we can get and set vnet_hdr_len, tap_probe_vnet_hdr_len
+
+ * 1. get number of memory regions from vhost module parameter
+ * max_mem_regions, supported in newer version linux kernel
+ */
+
+ fd = open(PATH_NET_TUN, O_RDWR);
+ if (fd < 0)
+ rte_panic("open %s error, %s\n", PATH_NET_TUN, strerror(errno));
+
+ memset(&ifr, 0, sizeof(ifr));
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+ if (ioctl(fd, TUNGETFEATURES, &features) == -1)
+ rte_panic("TUNGETFEATURES failed: %s", strerror(errno));
+
+ if (features & IFF_ONE_QUEUE)
+ ifr.ifr_flags |= IFF_ONE_QUEUE;
+
+ if (features & IFF_VNET_HDR)
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ else
+ rte_panic("vnet_hdr requested, but kernel does not support\n");
+
+ if (req_mq) {
+ if (features & IFF_MULTI_QUEUE)
+ ifr.ifr_flags |= IFF_MULTI_QUEUE;
+ else
+ rte_panic("multiqueue requested, but kernel does not support\n");
+ }
+
+ strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
+ if (ioctl(fd, TUNSETIFF, (void *) &ifr) == -1)
+ rte_panic("TUNSETIFF failed: %s", strerror(errno));
+ fcntl(fd, F_SETFL, O_NONBLOCK);
+
+ if (ioctl(fd, TUNSETVNETHDRSZ, &len) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+
+ if (ioctl(fd, TUNSETSNDBUF, &sndbuf) == -1)
+ rte_panic("TUNSETSNDBUF failed: %s", strerror(errno));
+
+ hw->backfd = fd;
+
+ hw->vhostfd = open(hw->path, O_RDWR);
+ if (hw->vhostfd == -1)
+ rte_panic("open %s failed: %s\n", hw->path, strerror(errno));
+}
+
+static void
+vhost_user_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int flag;
+ struct sockaddr_un un;
+
+ fd = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (fd < 0)
+ rte_panic("socket error, %s\n", strerror(errno));
+
+ flag = fcntl(fd, F_GETFD);
+ fcntl(fd, F_SETFD, flag | FD_CLOEXEC);
+
+ memset(&un, 0, sizeof(un));
+ un.sun_family = AF_UNIX;
+ snprintf(un.sun_path, sizeof(un.sun_path), "%s", hw->path);
+ if (connect(fd, (struct sockaddr *)&un, sizeof(un)) < 0) {
+ PMD_DRV_LOG(ERR, "connect error, %s\n", strerror(errno));
+ exit(-1);
+ }
+
+ hw->vhostfd = fd;
+}
+
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac)
+{
+ int i;
+ int ret;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ ret = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0], &tmp[1],
+ &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (ret == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ }
+ }
+
+ /* TODO: cq */
+
+ ret = stat(hw->path, &s);
+ if (ret < 0)
+ rte_panic("stat: %s failed, %s\n", hw->path, strerror(errno));
+
+ switch (s.st_mode & S_IFMT) {
+ hw->type = VHOST_KERNEL;
+ vhost_kernel_backend_setup(hw);
+ break;
+ hw->type = VHOST_USER;
+ vhost_user_backend_setup(hw);
+ break;
+ rte_panic("unknown file type of %s\n", hw->path);
+ }
+ if (vhost_call(hw, VHOST_MSG_SET_OWNER, NULL) == -1)
+ rte_panic("vhost set_owner failed: %s\n", strerror(errno));
+}
diff --git a/drivers/net/virtio/vhost.h b/drivers/net/virtio/vhost.h
new file mode 100644
index 0000000..c7517f6
--- /dev/null
+++ b/drivers/net/virtio/vhost.h
@@ -0,0 +1,192 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _VHOST_NET_USER_H
+#define _VHOST_NET_USER_H
+
+#include <stdint.h>
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VHOST_MEMORY_MAX_NREGIONS 8
+
+struct vhost_vring_state {
+ unsigned int index;
+ unsigned int num;
+};
+
+struct vhost_vring_file {
+ unsigned int index;
+ int fd;
+};
+
+struct vhost_vring_addr {
+ unsigned int index;
+ /* Option flags. */
+ unsigned int flags;
+ /* Flag values: */
+ /* Whether log address is valid. If set enables logging. */
+#define VHOST_VRING_F_LOG 0
+
+ /* Start of array of descriptors (virtually contiguous) */
+ uint64_t desc_user_addr;
+ /* Used structure address. Must be 32 bit aligned */
+ uint64_t used_user_addr;
+ /* Available structure address. Must be 16 bit aligned */
+ uint64_t avail_user_addr;
+ /* Logging support. */
+ /* Log writes to used structure, at offset calculated from specified
+ * address. Address must be 32 bit aligned. */
+ uint64_t log_guest_addr;
+};
+
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+
+typedef enum VhostUserRequest {
+ VHOST_USER_NONE = 0,
+ VHOST_USER_GET_FEATURES = 1,
+ VHOST_USER_SET_FEATURES = 2,
+ VHOST_USER_SET_OWNER = 3,
+ VHOST_USER_RESET_OWNER = 4,
+ VHOST_USER_SET_MEM_TABLE = 5,
+ VHOST_USER_SET_LOG_BASE = 6,
+ VHOST_USER_SET_LOG_FD = 7,
+ VHOST_USER_SET_VRING_NUM = 8,
+ VHOST_USER_SET_VRING_ADDR = 9,
+ VHOST_USER_SET_VRING_BASE = 10,
+ VHOST_USER_GET_VRING_BASE = 11,
+ VHOST_USER_SET_VRING_KICK = 12,
+ VHOST_USER_SET_VRING_CALL = 13,
+ VHOST_USER_SET_VRING_ERR = 14,
+ VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+ VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+ VHOST_USER_GET_QUEUE_NUM = 17,
+ VHOST_USER_SET_VRING_ENABLE = 18,
+ VHOST_USER_MAX
+} VhostUserRequest;
+
+struct vhost_memory_region {
+ uint64_t guest_phys_addr;
+ uint64_t memory_size; /* bytes */
+ uint64_t userspace_addr;
+ uint64_t mmap_offset;
+};
+struct vhost_memory_kernel {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[0];
+};
+
+struct vhost_memory {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[VHOST_MEMORY_MAX_NREGIONS];
+};
+
+typedef struct VhostUserMsg {
+ VhostUserRequest request;
+
+#define VHOST_USER_VERSION_MASK 0x3
+#define VHOST_USER_REPLY_MASK (0x1 << 2)
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+#define VHOST_USER_VRING_IDX_MASK 0xff
+#define VHOST_USER_VRING_NOFD_MASK (0x1 << 8)
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ struct vhost_memory memory;
+ } payload;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+} __attribute((packed)) VhostUserMsg;
+
+#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
+#define VHOST_USER_PAYLOAD_SIZE (sizeof(VhostUserMsg) - VHOST_USER_HDR_SIZE)
+
+/* The version of the protocol we support */
+#define VHOST_USER_VERSION 0x1
+
+/* ioctls */
+
+#define VHOST_VIRTIO 0xAF
+
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
+#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
+#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+/*****************************************************************************/
+
+/* Ioctl defines */
+#define TUNSETIFF _IOW('T', 202, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
+#define TUNSETOFFLOAD _IOW('T', 208, unsigned int)
+#define TUNGETIFF _IOR('T', 210, unsigned int)
+#define TUNSETSNDBUF _IOW('T', 212, int)
+#define TUNGETVNETHDRSZ _IOR('T', 215, int)
+#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE _IOW('T', 217, int)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNSETVNETBE _IOW('T', 222, int)
+
+/* TUNSETIFF ifr flags */
+#define IFF_TAP 0x0002
+#define IFF_NO_PI 0x1000
+#define IFF_ONE_QUEUE 0x2000
+#define IFF_VNET_HDR 0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
+
+/* Features for GSO (TUNSETOFFLOAD). */
+#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
+#define TUN_F_TSO4 0x02 /* I can handle TSO for IPv4 packets */
+#define TUN_F_TSO6 0x04 /* I can handle TSO for IPv6 packets */
+#define TUN_F_TSO_ECN 0x08 /* I can handle TSO with ECN bits. */
+#define TUN_F_UFO 0x10 /* I can handle UFO packets */
+
+#define PATH_NET_TUN "/dev/net/tun"
+
+#endif
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index ae2d47d..9e1ecb3 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -122,5 +122,8 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf
**tx_pkts,
#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)
-
+#ifdef RTE_VIRTIO_VDEV
+void virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq, int queue_num, char *mac);
+#endif
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 47f722a..af05ae2 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -147,7 +147,6 @@ struct virtqueue;
* rest are per-device feature bits.
*/
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32
I understand that this #define is not used, but... May be we should do this cleanup as a separate patch? Otherwise it's hard to
track this change (i believe this definition had some use in the past).
/* The Guest publishes the used index for which it expects an interrupt
* at the end of the avail ring. Host should ignore the avail->flags field. */
@@ -165,6 +164,7 @@ struct virtqueue;
struct virtio_hw {
struct virtqueue *cvq;
+#define VIRTIO_VDEV_IO_BASE 0xffffffff
uint32_t io_base;
uint32_t guest_features;
uint32_t max_tx_queues;
@@ -174,6 +174,21 @@ struct virtio_hw {
uint8_t use_msix;
uint8_t started;
uint8_t mac_addr[ETHER_ADDR_LEN];
+#ifdef RTE_VIRTIO_VDEV
+#define VHOST_KERNEL 0
+#define VHOST_USER 1
+ int type; /* type of backend */
+ uint32_t queue_num;
+ char *path;
+ int mac_specified;
+ int vhostfd;
+ int backfd; /* tap device used in vhost-net */
+ int callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ int kickfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ uint32_t queue_sel;
+ uint8_t status;
+ struct rte_eth_dev_data *data;
+#endif
Actually i am currently working on this too, and i decided to use different approach. I made these extra fields into a separate
structure, changed 'io_base' to a pointer, and now i can store there a pointer to this extra structure. Device type can easily be
determined by (dev->dev_type == RTE_ETH_DEV_PCI) check, so you don't need VIRTIO_VDEV_IO_BASE magic value.
};
/*
@@ -229,6 +244,39 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_PCI_REG_ADDR(hw, reg) \
(unsigned short)((hw)->io_base + (reg))
+#ifdef RTE_VIRTIO_VDEV
+uint32_t virtio_ioport_read(struct virtio_hw *, uint64_t);
+void virtio_ioport_write(struct virtio_hw *, uint64_t, uint32_t);
+
+#define VIRTIO_READ_REG_1(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inb((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_1(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outb_p((unsigned char)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#define VIRTIO_READ_REG_2(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inw((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_2(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outw_p((unsigned short)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#define VIRTIO_READ_REG_4(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inl((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_4(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
I also decided to add two fields to 'hw', where pointers to these accessors are stored. I think this should be faster, however,
yes, this is not performance-critical code because it's executed only during initialization.
+
+#else /* RTE_VIRTIO_VDEV */
+
#define VIRTIO_READ_REG_1(hw, reg) \
inb((VIRTIO_PCI_REG_ADDR((hw), (reg))))
#define VIRTIO_WRITE_REG_1(hw, reg, value) \
@@ -244,6 +292,8 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg))))
+#endif /* RTE_VIRTIO_VDEV */
+
static inline int
vtpci_with_feature(struct virtio_hw *hw, uint32_t bit)
{
--
2.1.4
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-11 14:02:42 UTC
Permalink
Hello! There's one more problem (see inline).
-----Original Message-----
Sent: Sunday, January 10, 2016 2:43 PM
Subject: [PATCH 3/4] virtio/vdev: add ways to interact with vhost
Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.
NOTE: we now keep CONFIG_RTE_VIRTIO_VDEV undefined by default, need
to be uncommented when in use.
---
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++++++++++++++++
drivers/net/virtio/vhost.h | 192 ++++++++++
drivers/net/virtio/virtio_ethdev.h | 5 +-
drivers/net/virtio/virtio_pci.h | 52 ++-
6 files changed, 990 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/virtio/vhost.c
create mode 100644 drivers/net/virtio/vhost.h
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..f76e162 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -534,3 +534,8 @@ CONFIG_RTE_APP_TEST=y
CONFIG_RTE_TEST_PMD=y
CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=n
CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=n
+
+#
+# Enable virtio support for container
+#
+CONFIG_RTE_VIRTIO_VDEV=y
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..0877023 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
+ifeq ($(CONFIG_RTE_VIRTIO_VDEV),y)
+ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += vhost.c
+endif
+
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/vhost.c b/drivers/net/virtio/vhost.c
new file mode 100644
index 0000000..e423e02
--- /dev/null
+++ b/drivers/net/virtio/vhost.c
@@ -0,0 +1,734 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "vhost.h"
+
+static int
+vhost_user_write(int fd, void *buf, int len, int *fds, int fd_num)
+{
+ struct msghdr msgh;
+ struct iovec iov;
+ int r;
+
+ size_t fd_size = fd_num * sizeof(int);
+ char control[CMSG_SPACE(fd_size)];
+ struct cmsghdr *cmsg;
+
+ memset(&msgh, 0, sizeof(msgh));
+ memset(control, 0, sizeof(control));
+
+ iov.iov_base = (uint8_t *)buf;
+ iov.iov_len = len;
+
+ msgh.msg_iov = &iov;
+ msgh.msg_iovlen = 1;
+
+ msgh.msg_control = control;
+ msgh.msg_controllen = sizeof(control);
+
+ cmsg = CMSG_FIRSTHDR(&msgh);
+
+ cmsg->cmsg_len = CMSG_LEN(fd_size);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ memcpy(CMSG_DATA(cmsg), fds, fd_size);
+
+ do {
+ r = sendmsg(fd, &msgh, 0);
+ } while (r < 0 && errno == EINTR);
+
+ return r;
+}
+
+static int
+vhost_user_read(int fd, VhostUserMsg *msg)
+{
+ uint32_t valid_flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+ int ret, sz_hdr = VHOST_USER_HDR_SIZE, sz_payload;
+
+ ret = recv(fd, (void *)msg, sz_hdr, 0);
+ if (ret < sz_hdr) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg hdr: %d instead of %d.",
+ ret, sz_hdr);
+ goto fail;
+ }
+
+ /* validate msg flags */
+ if (msg->flags != (valid_flags)) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg: flags 0x%x instead of 0x%x.",
+ msg->flags, valid_flags);
+ goto fail;
+ }
+
+ sz_payload = msg->size;
+ if (sz_payload) {
+ ret = recv(fd, (void *)((uint8_t *)msg + sz_hdr), sz_payload, 0);
+ if (ret < sz_payload) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg payload: %d instead of %d.",
+ ret, msg->size);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+ return -1;
+}
+
+static VhostUserMsg m __attribute__ ((unused));
+
+static void
+prepare_vhost_memory_user(VhostUserMsg *msg, int fds[])
+{
+ int i, num;
+ struct back_file *huges;
+ struct vhost_memory_region *mr;
+
+ num = rte_eal_get_backfile_info(&huges);
+
+ if (num > VHOST_MEMORY_MAX_NREGIONS)
+ rte_panic("%d hugepage files exceed the maximum of %d for "
+ "vhost-user\n", num, VHOST_MEMORY_MAX_NREGIONS);
+
+ for (i = 0; i < num; ++i) {
+ mr = &msg->payload.memory.regions[i];
+ mr->guest_phys_addr = (uint64_t)huges[i].addr; /* use vaddr! */
+ mr->userspace_addr = (uint64_t)huges[i].addr;
+ mr->memory_size = huges[i].size;
+ mr->mmap_offset = 0;
+ fds[i] = open(huges[i].filepath, O_RDWR);
+ }
+
+ msg->payload.memory.nregions = num;
+ msg->payload.memory.padding = 0;
+ free(huges);
+}
+
+static int
+vhost_user_sock(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ VhostUserMsg msg;
+ struct vhost_vring_file *file = 0;
+ int need_reply = 0;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+ int fd_num = 0;
+ int i, len;
+
+ msg.request = req;
+ msg.flags = VHOST_USER_VERSION;
+ msg.size = 0;
+
+ switch (req) {
+ need_reply = 1;
+ break;
+
+ msg.payload.u64 = *((__u64 *)arg);
+ msg.size = sizeof(m.payload.u64);
+ break;
+
+ break;
+
+ prepare_vhost_memory_user(&msg, fds);
+ fd_num = msg.payload.memory.nregions;
+ msg.size = sizeof(m.payload.memory.nregions);
+ msg.size += sizeof(m.payload.memory.padding);
+ msg.size += fd_num * sizeof(struct vhost_memory_region);
+ break;
+
+ fds[fd_num++] = *((int *)arg);
+ break;
+
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ break;
+
+ memcpy(&msg.payload.state, arg, sizeof(struct vhost_vring_state));
+ msg.size = sizeof(m.payload.state);
+ need_reply = 1;
+ break;
+
+ memcpy(&msg.payload.addr, arg, sizeof(struct vhost_vring_addr));
+ msg.size = sizeof(m.payload.addr);
+ break;
+
+ file = arg;
+ msg.payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK;
+ msg.size = sizeof(m.payload.u64);
+ if (file->fd > 0)
+ fds[fd_num++] = file->fd;
+ else
+ msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK;
+ break;
+
+ PMD_DRV_LOG(ERR, "vhost-user trying to send unhandled msg type");
+ return -1;
+ }
+
+ len = VHOST_USER_HDR_SIZE + msg.size;
+ if (vhost_user_write(hw->vhostfd, &msg, len, fds, fd_num) < 0)
+ return 0;
+
+ if (req == VHOST_USER_SET_MEM_TABLE)
+ for (i = 0; i < fd_num; ++i)
+ close(fds[i]);
+
+ if (need_reply) {
+ if (vhost_user_read(hw->vhostfd, &msg) < 0)
+ return -1;
+
+ if (req != msg.request) {
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+
+ switch (req) {
+ if (msg.size != sizeof(m.payload.u64)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ *((__u64 *)arg) = msg.payload.u64;
+ break;
+ if (msg.size != sizeof(m.payload.state)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ memcpy(arg, &msg.payload.state, sizeof(struct vhost_vring_state));
+ break;
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+static int
+vhost_kernel_ioctl(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ return ioctl(hw->vhostfd, req, arg);
+}
+
+enum {
+ VHOST_MSG_SET_OWNER,
+ VHOST_MSG_SET_FEATURES,
+ VHOST_MSG_GET_FEATURES,
+ VHOST_MSG_SET_VRING_CALL,
+ VHOST_MSG_SET_VRING_NUM,
+ VHOST_MSG_SET_VRING_BASE,
+ VHOST_MSG_SET_VRING_ADDR,
+ VHOST_MSG_SET_VRING_KICK,
+ VHOST_MSG_SET_MEM_TABLE,
+ VHOST_MSG_MAX,
+};
+
+static const char *vhost_msg_strings[] = {
+ "VHOST_MSG_SET_OWNER",
+ "VHOST_MSG_SET_FEATURES",
+ "VHOST_MSG_GET_FEATURES",
+ "VHOST_MSG_SET_VRING_CALL",
+ "VHOST_MSG_SET_VRING_NUM",
+ "VHOST_MSG_SET_VRING_BASE",
+ "VHOST_MSG_SET_VRING_ADDR",
+ "VHOST_MSG_SET_VRING_KICK",
+ "VHOST_MSG_SET_MEM_TABLE",
+ NULL,
+};
+
+static unsigned long int vhost_req_map[][2] = {
+ {VHOST_SET_OWNER, VHOST_USER_SET_OWNER},
+ {VHOST_SET_FEATURES, VHOST_USER_SET_FEATURES},
+ {VHOST_GET_FEATURES, VHOST_USER_GET_FEATURES},
+ {VHOST_SET_VRING_CALL, VHOST_USER_SET_VRING_CALL},
+ {VHOST_SET_VRING_NUM, VHOST_USER_SET_VRING_NUM},
+ {VHOST_SET_VRING_BASE, VHOST_USER_SET_VRING_BASE},
+ {VHOST_SET_VRING_ADDR, VHOST_USER_SET_VRING_ADDR},
+ {VHOST_SET_VRING_KICK, VHOST_USER_SET_VRING_KICK},
+ {VHOST_SET_MEM_TABLE, VHOST_USER_SET_MEM_TABLE},
+};
+
+static int
+vhost_call(struct virtio_hw *hw, unsigned long int req_orig, void *arg)
+{
+ int req_new;
+ int ret = 0;
+
+ if (req_orig >= VHOST_MSG_MAX)
+ rte_panic("invalid req: %lu\n", req_orig);
+
+ PMD_DRV_LOG(INFO, "%s\n", vhost_msg_strings[req_orig]);
+ req_new = vhost_req_map[req_orig][hw->type];
+ if (hw->type == VHOST_USER)
+ ret = vhost_user_sock(hw, req_new, arg);
+ else
+ ret = vhost_kernel_ioctl(hw, req_new, arg);
+
+ if (ret < 0)
+ rte_panic("vhost_call %s failed: %s\n",
+ vhost_msg_strings[req_orig], strerror(errno));
+
+ return ret;
+}
+
+static void
+kick_one_vq(struct virtio_hw *hw, struct virtqueue *vq, unsigned queue_sel)
+{
+ int callfd, kickfd;
+ struct vhost_vring_file file;
+
+ /* or use invalid flag to disable it, but vhost-dpdk uses this to judge
+ * if dev is alive. so finally we need two real event_fds.
+ */
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_CALL comes
+ * firstly because vhost depends on this msg to allocate virtqueue
+ * pair.
+ */
+ callfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (callfd < 0)
+ rte_panic("callfd error, %s\n", strerror(errno));
+
+ file.index = queue_sel;
+ file.fd = callfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_CALL, &file);
+ hw->callfds[queue_sel] = callfd;
+
+ struct vhost_vring_state state;
+ state.index = queue_sel;
+ state.num = vq->vq_ring.num;
+ vhost_call(hw, VHOST_MSG_SET_VRING_NUM, &state);
+
+ state.num = 0; /* no reservation */
+ vhost_call(hw, VHOST_MSG_SET_VRING_BASE, &state);
+
+ struct vhost_vring_addr addr = {
+ .index = queue_sel,
+ .desc_user_addr = (uint64_t)vq->vq_ring.desc,
+ .avail_user_addr = (uint64_t)vq->vq_ring.avail,
+ .used_user_addr = (uint64_t)vq->vq_ring.used,
+ .log_guest_addr = 0,
+ .flags = 0, /* disable log */
+ };
+ vhost_call(hw, VHOST_MSG_SET_VRING_ADDR, &addr);
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_KICK comes
+ * lastly because vhost depends on this msg to judge if
+ * virtio_is_ready().
+ */
+
+ kickfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (kickfd < 0)
+ rte_panic("kickfd error, %s\n", strerror(errno));
+
+ file.fd = kickfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_KICK, &file);
+ hw->kickfds[queue_sel] = kickfd;
+}
+
+/**
+ * Merge those virtually adjacent memsegs into one region.
+ */
+static void
+prepare_vhost_memory_kernel(struct vhost_memory_kernel **p_vm)
+{
+ unsigned i, j, k = 0;
+ struct rte_memseg *seg;
+ struct vhost_memory_region *mr;
+ struct vhost_memory_kernel *vm;
+
+ vm = malloc(sizeof(struct vhost_memory_kernel)
+ + RTE_MAX_MEMSEG * sizeof(struct vhost_memory_region));
+
+ for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
+ seg = &rte_eal_get_configuration()->mem_config->memseg[i];
+ if (seg->addr == NULL)
+ break;
+
+ int new_region = 1;
+ for (j = 0; j < k; ++j) {
+ mr = &vm->regions[j];
+
+ if (mr->userspace_addr + mr->memory_size
+ == (uint64_t)seg->addr) {
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+
+ if ((uint64_t)seg->addr + seg->len
+ == mr->userspace_addr) {
+ mr->guest_phys_addr = (uint64_t)seg->addr;
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+ }
+
+ if (new_region == 0)
+ continue;
+
+ mr = &vm->regions[k++];
+ mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size = seg->len;
+ mr->mmap_offset = 0;
+ }
+
+ vm->nregions = k;
+ vm->padding = 0;
+ *p_vm = vm;
+}
+
+static void kick_all_vq(struct virtio_hw *hw)
+{
+ int ret;
+ unsigned i, queue_sel, nvqs;
+ struct rte_eth_dev_data *data = hw->data;
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_memory_kernel *vm = NULL;
+ prepare_vhost_memory_kernel(&vm);
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, vm);
+ free(vm);
+ } else {
+ /* construct vhost_memory inside prepare_vhost_memory_user() */
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, NULL);
+ }
+
+ for (i = 0; i < data->nb_rx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_RQ_QUEUE_IDX;
+ kick_one_vq(hw, data->rx_queues[i], queue_sel);
+ }
+ for (i = 0; i < data->nb_tx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_TQ_QUEUE_IDX;
+ kick_one_vq(hw, data->tx_queues[i], queue_sel);
+ }
+
+ /* after setup all virtqueues, we need to set_features again
+ * so that these features can be set into each virtqueue in
+ * vhost side.
+ */
+ uint64_t features = hw->guest_features;
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+ if (ioctl(hw->backfd, TUNSETVNETHDRSZ, &hw->vtnet_hdr_size) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
This is a bug. With VHOST_USER hw->backfd is not initialized (i suppose contains 0), and attemting this on stdin does nothing good.
+ PMD_DRV_LOG(INFO, "set features:%"PRIx64"\n", features);
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_vring_file file;
+
+ file.fd = hw->backfd;
+ nvqs = data->nb_rx_queues + data->nb_tx_queues;
+ for (file.index = 0; file.index < nvqs; ++file.index) {
+ ret = vhost_kernel_ioctl(hw, VHOST_NET_SET_BACKEND, &file);
+ if (ret < 0)
+ rte_panic("VHOST_NET_SET_BACKEND failed, %s\n",
+ strerror(errno));
+ }
+ }
+
+ /* TODO: VHOST_SET_LOG_BASE */
+}
+
+void
+virtio_ioport_write(struct virtio_hw *hw, uint64_t addr, uint32_t val)
+{
+ uint64_t guest_features;
+
+ switch (addr) {
+ guest_features = val;
+ guest_features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &guest_features);
+ break;
+ /* do nothing */
+ break;
+ hw->queue_sel = val;
+ break;
+ if (val & VIRTIO_CONFIG_S_DRIVER_OK)
+ kick_all_vq(hw);
+ hw->status = val & 0xFF;
+ break;
+ {
+ int ret;
+ uint64_t buf = 1;
+ ret = write(hw->kickfds[val], &buf, sizeof(uint64_t));
+ if (ret == -1)
+ rte_panic("VIRTIO_PCI_QUEUE_NOTIFY failed: %s\n",
+ strerror(errno));
+ break;
+ }
+ PMD_DRV_LOG(ERR, "unexpected address %"PRIu64" value 0x%x\n",
+ addr, val);
+ break;
+ }
+}
+
+uint32_t
+virtio_ioport_read(struct virtio_hw *hw, uint64_t addr)
+{
+ uint32_t ret = 0xFFFFFFFF;
+ uint64_t host_features;
+
+ PMD_DRV_LOG(INFO, "addr: %"PRIu64"\n", addr);
+
+ switch (addr) {
+ vhost_call(hw, VHOST_MSG_GET_FEATURES, &host_features);
+ PMD_DRV_LOG(INFO, "get_features: %"PRIx64"\n", host_features);
+ if (hw->mac_specified)
+ host_features |= (1ull << VIRTIO_NET_F_MAC);
+ /* disable it until we support CQ */
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_RX);
+ ret = host_features;
+ break;
+ ret = hw->guest_features;
+ break;
+ ret = hw->queue_num;
+ break;
+ ret = hw->queue_sel;
+ break;
+ ret = hw->status;
+ break;
+ case 20: /* mac addr: 0~3 */
+ if (hw->mac_specified) {
+ uint32_t m0 = hw->mac_addr[0],
+ m1 = hw->mac_addr[1],
+ m2 = hw->mac_addr[2],
+ m3 = hw->mac_addr[3];
+ ret = (m3 << 24) | (m2 << 16) | (m1 << 8) | m0;
+ }
+ break;
+ case 24: /* mac addr: 4~5 */
+ if (hw->mac_specified) {
+ uint32_t m4 = hw->mac_addr[4],
+ m5 = hw->mac_addr[5];
+ ret = (m5 << 8) | m4;
+ }
+ break;
+ PMD_DRV_LOG(ERR, "%"PRIu64" (r) not supported\n", addr);
+ break;
+ }
+
+ return ret;
+}
+
+#define TUN_DEF_SNDBUF (1ull << 20)
+
+static void
+vhost_kernel_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int len = sizeof(struct virtio_net_hdr);
+ int req_mq = 0;
+ int sndbuf = TUN_DEF_SNDBUF;
+ unsigned int features;
+ struct ifreq ifr;
+
+ * 1. get and set offload capability, tap_probe_has_ufo, tap_fd_set_offload
+ * 2. verify we can get and set vnet_hdr_len, tap_probe_vnet_hdr_len
+
+ * 1. get number of memory regions from vhost module parameter
+ * max_mem_regions, supported in newer version linux kernel
+ */
+
+ fd = open(PATH_NET_TUN, O_RDWR);
+ if (fd < 0)
+ rte_panic("open %s error, %s\n", PATH_NET_TUN, strerror(errno));
+
+ memset(&ifr, 0, sizeof(ifr));
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+ if (ioctl(fd, TUNGETFEATURES, &features) == -1)
+ rte_panic("TUNGETFEATURES failed: %s", strerror(errno));
+
+ if (features & IFF_ONE_QUEUE)
+ ifr.ifr_flags |= IFF_ONE_QUEUE;
+
+ if (features & IFF_VNET_HDR)
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ else
+ rte_panic("vnet_hdr requested, but kernel does not support\n");
+
+ if (req_mq) {
+ if (features & IFF_MULTI_QUEUE)
+ ifr.ifr_flags |= IFF_MULTI_QUEUE;
+ else
+ rte_panic("multiqueue requested, but kernel does not support\n");
+ }
+
+ strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
+ if (ioctl(fd, TUNSETIFF, (void *) &ifr) == -1)
+ rte_panic("TUNSETIFF failed: %s", strerror(errno));
+ fcntl(fd, F_SETFL, O_NONBLOCK);
+
+ if (ioctl(fd, TUNSETVNETHDRSZ, &len) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+
+ if (ioctl(fd, TUNSETSNDBUF, &sndbuf) == -1)
+ rte_panic("TUNSETSNDBUF failed: %s", strerror(errno));
+
+ hw->backfd = fd;
+
+ hw->vhostfd = open(hw->path, O_RDWR);
+ if (hw->vhostfd == -1)
+ rte_panic("open %s failed: %s\n", hw->path, strerror(errno));
+}
+
+static void
+vhost_user_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int flag;
+ struct sockaddr_un un;
+
+ fd = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (fd < 0)
+ rte_panic("socket error, %s\n", strerror(errno));
+
+ flag = fcntl(fd, F_GETFD);
+ fcntl(fd, F_SETFD, flag | FD_CLOEXEC);
+
+ memset(&un, 0, sizeof(un));
+ un.sun_family = AF_UNIX;
+ snprintf(un.sun_path, sizeof(un.sun_path), "%s", hw->path);
+ if (connect(fd, (struct sockaddr *)&un, sizeof(un)) < 0) {
+ PMD_DRV_LOG(ERR, "connect error, %s\n", strerror(errno));
+ exit(-1);
+ }
+
+ hw->vhostfd = fd;
+}
+
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac)
+{
+ int i;
+ int ret;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ ret = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0], &tmp[1],
+ &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (ret == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ }
+ }
+
+ /* TODO: cq */
+
+ ret = stat(hw->path, &s);
+ if (ret < 0)
+ rte_panic("stat: %s failed, %s\n", hw->path, strerror(errno));
+
+ switch (s.st_mode & S_IFMT) {
+ hw->type = VHOST_KERNEL;
+ vhost_kernel_backend_setup(hw);
+ break;
+ hw->type = VHOST_USER;
+ vhost_user_backend_setup(hw);
+ break;
+ rte_panic("unknown file type of %s\n", hw->path);
+ }
+ if (vhost_call(hw, VHOST_MSG_SET_OWNER, NULL) == -1)
+ rte_panic("vhost set_owner failed: %s\n", strerror(errno));
+}
diff --git a/drivers/net/virtio/vhost.h b/drivers/net/virtio/vhost.h
new file mode 100644
index 0000000..c7517f6
--- /dev/null
+++ b/drivers/net/virtio/vhost.h
@@ -0,0 +1,192 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _VHOST_NET_USER_H
+#define _VHOST_NET_USER_H
+
+#include <stdint.h>
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VHOST_MEMORY_MAX_NREGIONS 8
+
+struct vhost_vring_state {
+ unsigned int index;
+ unsigned int num;
+};
+
+struct vhost_vring_file {
+ unsigned int index;
+ int fd;
+};
+
+struct vhost_vring_addr {
+ unsigned int index;
+ /* Option flags. */
+ unsigned int flags;
+ /* Flag values: */
+ /* Whether log address is valid. If set enables logging. */
+#define VHOST_VRING_F_LOG 0
+
+ /* Start of array of descriptors (virtually contiguous) */
+ uint64_t desc_user_addr;
+ /* Used structure address. Must be 32 bit aligned */
+ uint64_t used_user_addr;
+ /* Available structure address. Must be 16 bit aligned */
+ uint64_t avail_user_addr;
+ /* Logging support. */
+ /* Log writes to used structure, at offset calculated from specified
+ * address. Address must be 32 bit aligned. */
+ uint64_t log_guest_addr;
+};
+
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+
+typedef enum VhostUserRequest {
+ VHOST_USER_NONE = 0,
+ VHOST_USER_GET_FEATURES = 1,
+ VHOST_USER_SET_FEATURES = 2,
+ VHOST_USER_SET_OWNER = 3,
+ VHOST_USER_RESET_OWNER = 4,
+ VHOST_USER_SET_MEM_TABLE = 5,
+ VHOST_USER_SET_LOG_BASE = 6,
+ VHOST_USER_SET_LOG_FD = 7,
+ VHOST_USER_SET_VRING_NUM = 8,
+ VHOST_USER_SET_VRING_ADDR = 9,
+ VHOST_USER_SET_VRING_BASE = 10,
+ VHOST_USER_GET_VRING_BASE = 11,
+ VHOST_USER_SET_VRING_KICK = 12,
+ VHOST_USER_SET_VRING_CALL = 13,
+ VHOST_USER_SET_VRING_ERR = 14,
+ VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+ VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+ VHOST_USER_GET_QUEUE_NUM = 17,
+ VHOST_USER_SET_VRING_ENABLE = 18,
+ VHOST_USER_MAX
+} VhostUserRequest;
+
+struct vhost_memory_region {
+ uint64_t guest_phys_addr;
+ uint64_t memory_size; /* bytes */
+ uint64_t userspace_addr;
+ uint64_t mmap_offset;
+};
+struct vhost_memory_kernel {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[0];
+};
+
+struct vhost_memory {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[VHOST_MEMORY_MAX_NREGIONS];
+};
+
+typedef struct VhostUserMsg {
+ VhostUserRequest request;
+
+#define VHOST_USER_VERSION_MASK 0x3
+#define VHOST_USER_REPLY_MASK (0x1 << 2)
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+#define VHOST_USER_VRING_IDX_MASK 0xff
+#define VHOST_USER_VRING_NOFD_MASK (0x1 << 8)
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ struct vhost_memory memory;
+ } payload;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+} __attribute((packed)) VhostUserMsg;
+
+#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
+#define VHOST_USER_PAYLOAD_SIZE (sizeof(VhostUserMsg) - VHOST_USER_HDR_SIZE)
+
+/* The version of the protocol we support */
+#define VHOST_USER_VERSION 0x1
+
+/* ioctls */
+
+#define VHOST_VIRTIO 0xAF
+
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
+#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
+#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+/*****************************************************************************/
+
+/* Ioctl defines */
+#define TUNSETIFF _IOW('T', 202, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
+#define TUNSETOFFLOAD _IOW('T', 208, unsigned int)
+#define TUNGETIFF _IOR('T', 210, unsigned int)
+#define TUNSETSNDBUF _IOW('T', 212, int)
+#define TUNGETVNETHDRSZ _IOR('T', 215, int)
+#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE _IOW('T', 217, int)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNSETVNETBE _IOW('T', 222, int)
+
+/* TUNSETIFF ifr flags */
+#define IFF_TAP 0x0002
+#define IFF_NO_PI 0x1000
+#define IFF_ONE_QUEUE 0x2000
+#define IFF_VNET_HDR 0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
+
+/* Features for GSO (TUNSETOFFLOAD). */
+#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
+#define TUN_F_TSO4 0x02 /* I can handle TSO for IPv4 packets */
+#define TUN_F_TSO6 0x04 /* I can handle TSO for IPv6 packets */
+#define TUN_F_TSO_ECN 0x08 /* I can handle TSO with ECN bits. */
+#define TUN_F_UFO 0x10 /* I can handle UFO packets */
+
+#define PATH_NET_TUN "/dev/net/tun"
+
+#endif
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index ae2d47d..9e1ecb3 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -122,5 +122,8 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf
**tx_pkts,
#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)
-
+#ifdef RTE_VIRTIO_VDEV
+void virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
+ int nb_rx, int nb_tx, int nb_cq, int queue_num, char *mac);
+#endif
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 47f722a..af05ae2 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -147,7 +147,6 @@ struct virtqueue;
* rest are per-device feature bits.
*/
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32
/* The Guest publishes the used index for which it expects an interrupt
* at the end of the avail ring. Host should ignore the avail->flags field. */
@@ -165,6 +164,7 @@ struct virtqueue;
struct virtio_hw {
struct virtqueue *cvq;
+#define VIRTIO_VDEV_IO_BASE 0xffffffff
uint32_t io_base;
uint32_t guest_features;
uint32_t max_tx_queues;
@@ -174,6 +174,21 @@ struct virtio_hw {
uint8_t use_msix;
uint8_t started;
uint8_t mac_addr[ETHER_ADDR_LEN];
+#ifdef RTE_VIRTIO_VDEV
+#define VHOST_KERNEL 0
+#define VHOST_USER 1
+ int type; /* type of backend */
+ uint32_t queue_num;
+ char *path;
+ int mac_specified;
+ int vhostfd;
+ int backfd; /* tap device used in vhost-net */
+ int callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ int kickfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ uint32_t queue_sel;
+ uint8_t status;
+ struct rte_eth_dev_data *data;
+#endif
};
/*
@@ -229,6 +244,39 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_PCI_REG_ADDR(hw, reg) \
(unsigned short)((hw)->io_base + (reg))
+#ifdef RTE_VIRTIO_VDEV
+uint32_t virtio_ioport_read(struct virtio_hw *, uint64_t);
+void virtio_ioport_write(struct virtio_hw *, uint64_t, uint32_t);
+
+#define VIRTIO_READ_REG_1(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inb((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_1(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outb_p((unsigned char)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#define VIRTIO_READ_REG_2(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inw((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_2(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outw_p((unsigned short)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#define VIRTIO_READ_REG_4(hw, reg) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ inl((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_read(hw, reg)
+#define VIRTIO_WRITE_REG_4(hw, reg, value) \
+ (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
+ :virtio_ioport_write(hw, reg, value)
+
+#else /* RTE_VIRTIO_VDEV */
+
#define VIRTIO_READ_REG_1(hw, reg) \
inb((VIRTIO_PCI_REG_ADDR((hw), (reg))))
#define VIRTIO_WRITE_REG_1(hw, reg, value) \
@@ -244,6 +292,8 @@ outl_p(unsigned int data, unsigned int port)
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg))))
+#endif /* RTE_VIRTIO_VDEV */
+
static inline int
vtpci_with_feature(struct virtio_hw *hw, uint32_t bit)
{
--
2.1.4
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Xie, Huawei
2016-01-21 02:18:19 UTC
Permalink
Post by Jianfeng Tan
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_vring_file file;
+
+ file.fd = hw->backfd;
+ nvqs = data->nb_rx_queues + data->nb_tx_queues;
+ for (file.index = 0; file.index < nvqs; ++file.index) {
+ ret = vhost_kernel_ioctl(hw, VHOST_NET_SET_BACKEND, &file);
+ if (ret < 0)
+ rte_panic("VHOST_NET_SET_BACKEND failed, %s\n",
+ strerror(errno));
+ }
+ }
+
+ /* TODO: VHOST_SET_LOG_BASE */
We needn't support VHOST_SET_LOG_BASE.
Jianfeng Tan
2016-01-10 11:43:02 UTC
Permalink
Add a new virtual device named eth_cvio, it can be used just like
eth_ring, eth_null, etc.

Configured parameters include:
- rx (optional, 1 by default): number of rx, only allowed to be
1 for now.
- tx (optional, 1 by default): number of tx, only allowed to be
1 for now.
- cq (optional, 0 by default): if ctrl queue is enabled, not
supported for now.
- mac (optional): mac address, random value will be given if not
specified.
- queue_num (optional, 256 by default): size of virtqueue.
- path (madatory): path of vhost, depends on the file type:
vhost-user is used if the given path points to
a unix socket; vhost-net is used if the given
path points to a char device.

The major difference with original virtio for vm is that, here we
use virtual address instead of physical address for vhost to
calculate relative address.

When enable CONFIG_RTE_VIRTIO_VDEV (enabled by default), the compiled
library can be used in both VM and container environment.

Examples:
a. Use vhost-net as a backend
sudo numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 \
-m 1024 --no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=/dev/vhost-net \
-- -p 0x1

b. Use vhost-user as a backend
numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 -m 1024 \
--no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=<path_to_vhost_user> \
-- -p 0x1

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 338 +++++++++++++++++++++++++-------
drivers/net/virtio/virtio_ethdev.h | 1 +
drivers/net/virtio/virtio_pci.h | 24 +--
drivers/net/virtio/virtio_rxtx.c | 11 +-
drivers/net/virtio/virtio_rxtx_simple.c | 14 +-
drivers/net/virtio/virtqueue.h | 13 +-
6 files changed, 302 insertions(+), 99 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index d928339..6e46060 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -56,6 +56,7 @@
#include <rte_memory.h>
#include <rte_eal.h>
#include <rte_dev.h>
+#include <rte_kvargs.h>

#include "virtio_ethdev.h"
#include "virtio_pci.h"
@@ -174,14 +175,14 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
* One RX packet for ACK.
*/
vq->vq_ring.desc[head].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mz->phys_addr;
+ vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mem;
vq->vq_ring.desc[head].len = sizeof(struct virtio_net_ctrl_hdr);
vq->vq_free_cnt--;
i = vq->vq_ring.desc[head].next;

for (k = 0; k < pkt_num; k++) {
vq->vq_ring.desc[i].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr)
+ sizeof(ctrl->status) + sizeof(uint8_t)*sum;
vq->vq_ring.desc[i].len = dlen[k];
@@ -191,7 +192,7 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
}

vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr);
vq->vq_ring.desc[i].len = sizeof(ctrl->status);
vq->vq_free_cnt--;
@@ -374,68 +375,85 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
}
}

- /*
- * Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
- * and only accepts 32 bit page frame number.
- * Check if the allocated physical memory exceeds 16TB.
- */
- if ((mz->phys_addr + vq->vq_ring_size - 1) >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
- PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
- rte_free(vq);
- return -ENOMEM;
- }
-
memset(mz->addr, 0, sizeof(mz->len));
vq->mz = mz;
- vq->vq_ring_mem = mz->phys_addr;
vq->vq_ring_virt_mem = mz->addr;
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64, (uint64_t)mz->phys_addr);
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, (uint64_t)(uintptr_t)mz->addr);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ vq->vq_ring_mem = mz->phys_addr;
+
+ /* Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
+ * and only accepts 32 bit page frame number.
+ * Check if the allocated physical memory exceeds 16TB.
+ */
+ uint64_t last_physaddr = vq->vq_ring_mem + vq->vq_ring_size - 1;
+ if (last_physaddr >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
+ PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
+ rte_free(vq);
+ return -ENOMEM;
+ }
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->vq_ring_mem = (phys_addr_t)mz->addr; /* Use vaddr!!! */
+#endif
+
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64,
+ (uint64_t)vq->vq_ring_mem);
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
+ (uint64_t)(uintptr_t)vq->vq_ring_virt_mem);
vq->virtio_net_hdr_mz = NULL;
vq->virtio_net_hdr_mem = 0;

+ uint64_t hdr_size = 0;
if (queue_type == VTNET_TQ) {
/*
* For each xmit packet, allocate a virtio_net_hdr
*/
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d_hdrzone",
dev->data->port_id, queue_idx);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- vq_size * hw->vtnet_hdr_size,
- socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
- if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
- rte_free(vq);
- return -ENOMEM;
- }
- }
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0,
- vq_size * hw->vtnet_hdr_size);
+ hdr_size = vq_size * hw->vtnet_hdr_size;
} else if (queue_type == VTNET_CQ) {
/* Allocate a page for control vq command, data and status */
snprintf(vq_name, sizeof(vq_name), "port%d_cvq_hdrzone",
dev->data->port_id);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- PAGE_SIZE, socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
+ hdr_size = PAGE_SIZE;
+ }
+
+ if (hdr_size) { /* queue_type is VTNET_TQ or VTNET_CQ */
+ mz = rte_memzone_reserve_aligned(vq_name,
+ hdr_size, socket_id, 0, RTE_CACHE_LINE_SIZE);
+ if (mz == NULL) {
if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
+ mz = rte_memzone_lookup(vq_name);
+ if (mz == NULL) {
rte_free(vq);
return -ENOMEM;
}
}
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0, PAGE_SIZE);
+ vq->virtio_net_hdr_mz = mz;
+ vq->virtio_net_hdr_vaddr = mz->addr;
+ memset(vq->virtio_net_hdr_vaddr, 0, hdr_size);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->virtio_net_hdr_mem = mz->phys_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->virtio_net_hdr_mem = (phys_addr_t)mz->addr; /* Use vaddr!!! */
+#endif
}

+ struct rte_mbuf *m = NULL;
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->offset = (uintptr_t)&m->buf_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ vq->offset = (uintptr_t)&m->buf_physaddr;
+#if (RTE_BYTE_ORDER == RTE_BIG_ENDIAN) && (__WORDSIZE == 32)
+ vq->offset += 4;
+#endif
+ }
+#endif
/*
* Set guest physical address of the virtqueue
* in VIRTIO_PCI_QUEUE_PFN config register of device
@@ -491,8 +509,10 @@ virtio_dev_close(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "virtio_dev_close");

/* reset the NIC */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ }
vtpci_reset(hw);
hw->started = 0;
virtio_dev_free_mbufs(dev);
@@ -1233,8 +1253,9 @@ virtio_interrupt_handler(__rte_unused struct rte_intr_handle *handle,
isr = vtpci_isr(hw);
PMD_DRV_LOG(INFO, "interrupt status = %#x", isr);

- if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
- PMD_DRV_LOG(ERR, "interrupt enable failed");
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
+ PMD_DRV_LOG(ERR, "interrupt enable failed");

if (isr & VIRTIO_PCI_ISR_CONFIG) {
if (virtio_dev_link_update(dev, 0) == 0)
@@ -1287,11 +1308,18 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)

pci_dev = eth_dev->pci_dev;

- if (virtio_resource_init(pci_dev) < 0)
- return -1;
-
- hw->use_msix = virtio_has_msix(&pci_dev->addr);
- hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (virtio_resource_init(pci_dev) < 0)
+ return -1;
+ hw->use_msix = virtio_has_msix(&pci_dev->addr);
+ hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ hw->use_msix = 0;
+ hw->io_base = VIRTIO_VDEV_IO_BASE;
+ }
+#endif

/* Reset the device although not necessary at startup */
vtpci_reset(hw);
@@ -1304,10 +1332,12 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
virtio_negotiate_features(hw);

/* If host does not support status then disable LSC */
- if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
- pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
+ pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;

- rte_eth_copy_pci_info(eth_dev, pci_dev);
+ rte_eth_copy_pci_info(eth_dev, pci_dev);
+ }

rx_func_get(eth_dev);

@@ -1383,15 +1413,16 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)

PMD_INIT_LOG(DEBUG, "hw->max_rx_queues=%d hw->max_tx_queues=%d",
hw->max_rx_queues, hw->max_tx_queues);
- PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
- eth_dev->data->port_id, pci_dev->id.vendor_id,
- pci_dev->id.device_id);
-
- /* Setup interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_register(&pci_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
-
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
+ eth_dev->data->port_id, pci_dev->id.vendor_id,
+ pci_dev->id.device_id);
+
+ /* Setup interrupt callback */
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_register(&pci_dev->intr_handle,
+ virtio_interrupt_handler, eth_dev);
+ }
virtio_dev_cq_start(eth_dev);

return 0;
@@ -1424,10 +1455,12 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
eth_dev->data->mac_addrs = NULL;

/* reset interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_unregister(&pci_dev->intr_handle,
- virtio_interrupt_handler,
- eth_dev);
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_unregister(&pci_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev);
+ }

PMD_INIT_LOG(DEBUG, "dev_uninit completed");

@@ -1491,11 +1524,13 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return -ENOTSUP;
}

- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
- PMD_DRV_LOG(ERR, "failed to set config vector");
- return -EBUSY;
- }
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
+ PMD_DRV_LOG(ERR, "failed to set config vector");
+ return -EBUSY;
+ }
+ }

return 0;
}
@@ -1689,3 +1724,162 @@ static struct rte_driver rte_virtio_driver = {
};

PMD_REGISTER_DRIVER(rte_virtio_driver);
+
+#ifdef RTE_VIRTIO_VDEV
+
+static const char *valid_args[] = {
+#define ETH_CVIO_ARG_RX_NUM "rx"
+ ETH_CVIO_ARG_RX_NUM,
+#define ETH_CVIO_ARG_TX_NUM "tx"
+ ETH_CVIO_ARG_TX_NUM,
+#define ETH_CVIO_ARG_CQ_NUM "cq"
+ ETH_CVIO_ARG_CQ_NUM,
+#define ETH_CVIO_ARG_MAC "mac"
+ ETH_CVIO_ARG_MAC,
+#define ETH_CVIO_ARG_PATH "path"
+ ETH_CVIO_ARG_PATH,
+#define ETH_CVIO_ARG_QUEUE_SIZE "queue_num"
+ ETH_CVIO_ARG_QUEUE_SIZE,
+ NULL
+};
+
+static int
+get_string_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ if ((value == NULL) || (extra_args == NULL))
+ return -EINVAL;
+
+ strcpy(extra_args, value);
+
+ return 0;
+}
+
+static int
+get_integer_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ uint64_t *p_u64 = extra_args;
+
+ if ((value == NULL) || (extra_args == NULL))
+ return -EINVAL;
+
+ *p_u64 = (uint64_t)strtoull(value, NULL, 0);
+
+ return 0;
+}
+
+static struct rte_eth_dev *
+cvio_eth_dev_alloc(const char *name)
+{
+ struct rte_eth_dev *eth_dev;
+ struct rte_eth_dev_data *data;
+ struct virtio_hw *hw;
+
+ eth_dev = rte_eth_dev_allocate(name, RTE_ETH_DEV_VIRTUAL);
+ if (eth_dev == NULL)
+ rte_panic("cannot alloc rte_eth_dev\n");
+
+ data = eth_dev->data;
+
+ hw = rte_zmalloc(NULL, sizeof(*hw), 0);
+ if (!hw)
+ rte_panic("malloc virtio_hw failed\n");
+
+ data->dev_private = hw;
+ data->numa_node = SOCKET_ID_ANY;
+ eth_dev->pci_dev = NULL;
+ /* will be used in virtio_dev_info_get() */
+ eth_dev->driver = &rte_virtio_pmd;
+ /* TODO: eth_dev->link_intr_cbs */
+ return eth_dev;
+}
+
+#define CVIO_DEF_CQ_EN 0
+#define CVIO_DEF_Q_NUM 1
+#define CVIO_DEF_Q_SZ 256
+/*
+ * Dev initialization routine. Invoked once for each virtio vdev at
+ * EAL init time, see rte_eal_dev_init().
+ * Returns 0 on success.
+ */
+static int
+rte_cvio_pmd_devinit(const char *name, const char *params)
+{
+ struct rte_kvargs *kvlist = NULL;
+ struct rte_eth_dev *eth_dev = NULL;
+ uint64_t nb_rx = CVIO_DEF_Q_NUM;
+ uint64_t nb_tx = CVIO_DEF_Q_NUM;
+ uint64_t nb_cq = CVIO_DEF_CQ_EN;
+ uint64_t queue_num = CVIO_DEF_Q_SZ;
+ char sock_path[256];
+ char mac_addr[32];
+ int flag_mac = 0;
+
+ if (params == NULL || params[0] == '\0')
+ rte_panic("arg %s is mandatory for eth_cvio\n",
+ ETH_CVIO_ARG_QUEUE_SIZE);
+
+ kvlist = rte_kvargs_parse(params, valid_args);
+ if (!kvlist)
+ rte_panic("error when parsing param\n");
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_PATH) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_PATH,
+ &get_string_arg, sock_path);
+ else
+ rte_panic("arg %s is mandatory for eth_cvio\n",
+ ETH_CVIO_ARG_QUEUE_SIZE);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_MAC) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_MAC,
+ &get_string_arg, mac_addr);
+ flag_mac = 1;
+ }
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_QUEUE_SIZE) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_QUEUE_SIZE,
+ &get_integer_arg, &queue_num);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_RX_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_RX_NUM,
+ &get_integer_arg, &nb_rx);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_TX_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_TX_NUM,
+ &get_integer_arg, &nb_tx);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_CQ_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_CQ_NUM,
+ &get_integer_arg, &nb_cq);
+
+ eth_dev = cvio_eth_dev_alloc(name);
+
+ virtio_vdev_init(eth_dev->data, sock_path,
+ nb_rx, nb_tx, nb_cq, queue_num,
+ (flag_mac) ? mac_addr : NULL);
+
+ /* originally, this will be called in rte_eal_pci_probe() */
+ eth_virtio_dev_init(eth_dev);
+
+ return 0;
+}
+
+static int
+rte_cvio_pmd_devuninit(const char *name)
+{
+ /* TODO: if it's last one, memory init, free memory */
+ rte_panic("%s: %s", __func__, name);
+ return 0;
+}
+
+static struct rte_driver rte_cvio_driver = {
+ .name = "eth_cvio",
+ .type = PMD_VDEV,
+ .init = rte_cvio_pmd_devinit,
+ .uninit = rte_cvio_pmd_devuninit,
+};
+
+PMD_REGISTER_DRIVER(rte_cvio_driver);
+
+#endif
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9e1ecb3..90890b4 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -126,4 +126,5 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
void virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
int nb_rx, int nb_tx, int nb_cq, int queue_num, char *mac);
#endif
+
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index af05ae2..d79bd05 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -249,31 +249,31 @@ uint32_t virtio_ioport_read(struct virtio_hw *, uint64_t);
void virtio_ioport_write(struct virtio_hw *, uint64_t, uint32_t);

#define VIRTIO_READ_REG_1(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inb((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_1(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outb_p((unsigned char)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))

#define VIRTIO_READ_REG_2(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inw((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_2(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outw_p((unsigned short)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))

#define VIRTIO_READ_REG_4(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inl((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))

#else /* RTE_VIRTIO_VDEV */

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 74b39ef..dd07ba7 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -191,8 +191,7 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)

start_dp = vq->vq_ring.desc;
start_dp[idx].addr =
- (uint64_t)(cookie->buf_physaddr + RTE_PKTMBUF_HEADROOM
- - hw->vtnet_hdr_size);
+ RTE_MBUF_DATA_DMA_ADDR(cookie, vq->offset) - hw->vtnet_hdr_size;
start_dp[idx].len =
cookie->buf_len - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
start_dp[idx].flags = VRING_DESC_F_WRITE;
@@ -237,7 +236,7 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct rte_mbuf *cookie)

for (; ((seg_num > 0) && (cookie != NULL)); seg_num--) {
idx = start_dp[idx].next;
- start_dp[idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie);
+ start_dp[idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie, txvq->offset);
start_dp[idx].len = cookie->data_len;
start_dp[idx].flags = VRING_DESC_F_NEXT;
cookie = cookie->next;
@@ -343,7 +342,7 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
if (use_simple_rxtx) {
int mid_idx = vq->vq_nentries >> 1;
@@ -366,12 +365,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else {
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
}
}

diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ff3c11a..3a14a4e 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -80,8 +80,8 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
vq->sw_ring[desc_idx] = cookie;

start_dp = vq->vq_ring.desc;
- start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie, vq->offset)
+ - sizeof(struct virtio_net_hdr);
start_dp[desc_idx].len = cookie->buf_len -
RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);

@@ -118,9 +118,8 @@ virtio_rxq_rearm_vec(struct virtqueue *rxvq)
p = (uintptr_t)&sw_ring[i]->rearm_data;
*(uint64_t *)p = rxvq->mbuf_initializer;

- start_dp[i].addr =
- (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].addr = RTE_MBUF_DATA_DMA_ADDR(sw_ring[i], rxvq->offset)
+ - sizeof(struct virtio_net_hdr);
start_dp[i].len = sw_ring[i]->buf_len -
RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
}
@@ -366,7 +365,7 @@ virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
for (i = 0; i < nb_tail; i++) {
start_dp[desc_idx].addr =
- RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts, txvq->offset);
start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
tx_pkts++;
desc_idx++;
@@ -377,7 +376,8 @@ virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
for (i = 0; i < nb_commit; i++)
txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
for (i = 0; i < nb_commit; i++) {
- start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts,
+ txvq->offset);
start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
tx_pkts++;
desc_idx++;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 61b3137..dc0b656 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -66,8 +66,14 @@ struct rte_mbuf;

#define VIRTQUEUE_MAX_NAME_SZ 32

-#define RTE_MBUF_DATA_DMA_ADDR(mb) \
+#ifdef RTE_VIRTIO_VDEV
+#define RTE_MBUF_DATA_DMA_ADDR(mb, offset) \
+ (uint64_t)((uintptr_t)(*(void **)((uintptr_t)mb + offset)) \
+ + (mb)->data_off)
+#else
+#define RTE_MBUF_DATA_DMA_ADDR(mb, offset) \
(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
+#endif /* RTE_VIRTIO_VDEV */

#define VTNET_SQ_RQ_QUEUE_IDX 0
#define VTNET_SQ_TQ_QUEUE_IDX 1
@@ -167,7 +173,8 @@ struct virtqueue {

void *vq_ring_virt_mem; /**< linear address of vring*/
unsigned int vq_ring_size;
- phys_addr_t vq_ring_mem; /**< physical address of vring */
+ phys_addr_t vq_ring_mem; /**< phys address of vring for pci dev,
+ virt addr of vring for vdev */

struct vring vq_ring; /**< vring keeping desc, used and avail */
uint16_t vq_free_cnt; /**< num of desc available */
@@ -186,8 +193,10 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint16_t offset; /**< relative offset to obtain addr in mbuf */
uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */
+ void *virtio_net_hdr_vaddr; /**< linear address of vring*/

struct rte_mbuf **sw_ring; /**< RX software ring. */
/* dummy mbuf, for wraparound when processing RX ring. */
--
2.1.4
Pavel Fedin
2016-01-12 07:45:59 UTC
Permalink
Hello!

See inline
-----Original Message-----
Sent: Sunday, January 10, 2016 2:43 PM
Subject: [PATCH 4/4] virtio/vdev: add a new vdev named eth_cvio
Add a new virtual device named eth_cvio, it can be used just like
eth_ring, eth_null, etc.
- rx (optional, 1 by default): number of rx, only allowed to be
1 for now.
- tx (optional, 1 by default): number of tx, only allowed to be
1 for now.
- cq (optional, 0 by default): if ctrl queue is enabled, not
supported for now.
- mac (optional): mac address, random value will be given if not
specified.
- queue_num (optional, 256 by default): size of virtqueue.
vhost-user is used if the given path points to
a unix socket; vhost-net is used if the given
path points to a char device.
The major difference with original virtio for vm is that, here we
use virtual address instead of physical address for vhost to
calculate relative address.
When enable CONFIG_RTE_VIRTIO_VDEV (enabled by default), the compiled
library can be used in both VM and container environment.
a. Use vhost-net as a backend
sudo numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 \
-m 1024 --no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=/dev/vhost-net \
-- -p 0x1
b. Use vhost-user as a backend
numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 -m 1024 \
--no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=<path_to_vhost_user> \
-- -p 0x1
---
drivers/net/virtio/virtio_ethdev.c | 338 +++++++++++++++++++++++++-------
drivers/net/virtio/virtio_ethdev.h | 1 +
drivers/net/virtio/virtio_pci.h | 24 +--
drivers/net/virtio/virtio_rxtx.c | 11 +-
drivers/net/virtio/virtio_rxtx_simple.c | 14 +-
drivers/net/virtio/virtqueue.h | 13 +-
6 files changed, 302 insertions(+), 99 deletions(-)
diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index d928339..6e46060 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -56,6 +56,7 @@
#include <rte_memory.h>
#include <rte_eal.h>
#include <rte_dev.h>
+#include <rte_kvargs.h>
#include "virtio_ethdev.h"
#include "virtio_pci.h"
@@ -174,14 +175,14 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
* One RX packet for ACK.
*/
vq->vq_ring.desc[head].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mz->phys_addr;
+ vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mem;
vq->vq_ring.desc[head].len = sizeof(struct virtio_net_ctrl_hdr);
vq->vq_free_cnt--;
i = vq->vq_ring.desc[head].next;
for (k = 0; k < pkt_num; k++) {
vq->vq_ring.desc[i].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr)
+ sizeof(ctrl->status) + sizeof(uint8_t)*sum;
vq->vq_ring.desc[i].len = dlen[k];
@@ -191,7 +192,7 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
}
vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr);
vq->vq_ring.desc[i].len = sizeof(ctrl->status);
vq->vq_free_cnt--;
@@ -374,68 +375,85 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
}
}
- /*
- * Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
- * and only accepts 32 bit page frame number.
- * Check if the allocated physical memory exceeds 16TB.
- */
- if ((mz->phys_addr + vq->vq_ring_size - 1) >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
- PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
- rte_free(vq);
- return -ENOMEM;
- }
-
memset(mz->addr, 0, sizeof(mz->len));
vq->mz = mz;
- vq->vq_ring_mem = mz->phys_addr;
vq->vq_ring_virt_mem = mz->addr;
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64, (uint64_t)mz->phys_addr);
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, (uint64_t)(uintptr_t)mz->addr);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ vq->vq_ring_mem = mz->phys_addr;
+
+ /* Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
+ * and only accepts 32 bit page frame number.
+ * Check if the allocated physical memory exceeds 16TB.
+ */
+ uint64_t last_physaddr = vq->vq_ring_mem + vq->vq_ring_size - 1;
+ if (last_physaddr >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
+ PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
+ rte_free(vq);
+ return -ENOMEM;
+ }
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->vq_ring_mem = (phys_addr_t)mz->addr; /* Use vaddr!!! */
+#endif
+
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64,
+ (uint64_t)vq->vq_ring_mem);
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
+ (uint64_t)(uintptr_t)vq->vq_ring_virt_mem);
vq->virtio_net_hdr_mz = NULL;
vq->virtio_net_hdr_mem = 0;
+ uint64_t hdr_size = 0;
if (queue_type == VTNET_TQ) {
/*
* For each xmit packet, allocate a virtio_net_hdr
*/
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d_hdrzone",
dev->data->port_id, queue_idx);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- vq_size * hw->vtnet_hdr_size,
- socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
- if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
- rte_free(vq);
- return -ENOMEM;
- }
- }
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0,
- vq_size * hw->vtnet_hdr_size);
+ hdr_size = vq_size * hw->vtnet_hdr_size;
} else if (queue_type == VTNET_CQ) {
/* Allocate a page for control vq command, data and status */
snprintf(vq_name, sizeof(vq_name), "port%d_cvq_hdrzone",
dev->data->port_id);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- PAGE_SIZE, socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
+ hdr_size = PAGE_SIZE;
+ }
+
+ if (hdr_size) { /* queue_type is VTNET_TQ or VTNET_CQ */
+ mz = rte_memzone_reserve_aligned(vq_name,
+ hdr_size, socket_id, 0, RTE_CACHE_LINE_SIZE);
+ if (mz == NULL) {
if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
+ mz = rte_memzone_lookup(vq_name);
+ if (mz == NULL) {
rte_free(vq);
return -ENOMEM;
}
}
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0, PAGE_SIZE);
+ vq->virtio_net_hdr_mz = mz;
+ vq->virtio_net_hdr_vaddr = mz->addr;
+ memset(vq->virtio_net_hdr_vaddr, 0, hdr_size);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->virtio_net_hdr_mem = mz->phys_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->virtio_net_hdr_mem = (phys_addr_t)mz->addr; /* Use vaddr!!! */
+#endif
}
+ struct rte_mbuf *m = NULL;
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->offset = (uintptr_t)&m->buf_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ vq->offset = (uintptr_t)&m->buf_physaddr;
Not sure, but shouldn't these be swapped? Originally, for PCI devices, we used buf_physaddr.
+#if (RTE_BYTE_ORDER == RTE_BIG_ENDIAN) && (__WORDSIZE == 32)
+ vq->offset += 4;
+#endif
+ }
+#endif
/*
* Set guest physical address of the virtqueue
* in VIRTIO_PCI_QUEUE_PFN config register of device
@@ -491,8 +509,10 @@ virtio_dev_close(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "virtio_dev_close");
/* reset the NIC */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ }
vtpci_reset(hw);
hw->started = 0;
virtio_dev_free_mbufs(dev);
@@ -1233,8 +1253,9 @@ virtio_interrupt_handler(__rte_unused struct rte_intr_handle *handle,
isr = vtpci_isr(hw);
PMD_DRV_LOG(INFO, "interrupt status = %#x", isr);
- if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
- PMD_DRV_LOG(ERR, "interrupt enable failed");
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
+ PMD_DRV_LOG(ERR, "interrupt enable failed");
if (isr & VIRTIO_PCI_ISR_CONFIG) {
if (virtio_dev_link_update(dev, 0) == 0)
@@ -1287,11 +1308,18 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
pci_dev = eth_dev->pci_dev;
- if (virtio_resource_init(pci_dev) < 0)
- return -1;
-
- hw->use_msix = virtio_has_msix(&pci_dev->addr);
- hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (virtio_resource_init(pci_dev) < 0)
+ return -1;
+ hw->use_msix = virtio_has_msix(&pci_dev->addr);
+ hw->io_base = (uint32_t)(uintptr_t)pci_dev->mem_resource[0].addr;
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ hw->use_msix = 0;
+ hw->io_base = VIRTIO_VDEV_IO_BASE;
+ }
+#endif
/* Reset the device although not necessary at startup */
vtpci_reset(hw);
@@ -1304,10 +1332,12 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
virtio_negotiate_features(hw);
/* If host does not support status then disable LSC */
- if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
- pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
+ pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
- rte_eth_copy_pci_info(eth_dev, pci_dev);
+ rte_eth_copy_pci_info(eth_dev, pci_dev);
+ }
rx_func_get(eth_dev);
@@ -1383,15 +1413,16 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
PMD_INIT_LOG(DEBUG, "hw->max_rx_queues=%d hw->max_tx_queues=%d",
hw->max_rx_queues, hw->max_tx_queues);
- PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
- eth_dev->data->port_id, pci_dev->id.vendor_id,
- pci_dev->id.device_id);
-
- /* Setup interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_register(&pci_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
-
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
+ eth_dev->data->port_id, pci_dev->id.vendor_id,
+ pci_dev->id.device_id);
+
+ /* Setup interrupt callback */
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_register(&pci_dev->intr_handle,
+ virtio_interrupt_handler, eth_dev);
+ }
virtio_dev_cq_start(eth_dev);
return 0;
@@ -1424,10 +1455,12 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
eth_dev->data->mac_addrs = NULL;
/* reset interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_unregister(&pci_dev->intr_handle,
- virtio_interrupt_handler,
- eth_dev);
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_unregister(&pci_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev);
+ }
PMD_INIT_LOG(DEBUG, "dev_uninit completed");
@@ -1491,11 +1524,13 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return -ENOTSUP;
}
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
- PMD_DRV_LOG(ERR, "failed to set config vector");
- return -EBUSY;
- }
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
+ PMD_DRV_LOG(ERR, "failed to set config vector");
+ return -EBUSY;
+ }
+ }
return 0;
}
@@ -1689,3 +1724,162 @@ static struct rte_driver rte_virtio_driver = {
};
PMD_REGISTER_DRIVER(rte_virtio_driver);
+
+#ifdef RTE_VIRTIO_VDEV
+
+static const char *valid_args[] = {
+#define ETH_CVIO_ARG_RX_NUM "rx"
+ ETH_CVIO_ARG_RX_NUM,
+#define ETH_CVIO_ARG_TX_NUM "tx"
+ ETH_CVIO_ARG_TX_NUM,
+#define ETH_CVIO_ARG_CQ_NUM "cq"
+ ETH_CVIO_ARG_CQ_NUM,
+#define ETH_CVIO_ARG_MAC "mac"
+ ETH_CVIO_ARG_MAC,
+#define ETH_CVIO_ARG_PATH "path"
+ ETH_CVIO_ARG_PATH,
+#define ETH_CVIO_ARG_QUEUE_SIZE "queue_num"
+ ETH_CVIO_ARG_QUEUE_SIZE,
+ NULL
+};
+
+static int
+get_string_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ if ((value == NULL) || (extra_args == NULL))
+ return -EINVAL;
+
+ strcpy(extra_args, value);
+
+ return 0;
+}
+
+static int
+get_integer_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ uint64_t *p_u64 = extra_args;
+
+ if ((value == NULL) || (extra_args == NULL))
+ return -EINVAL;
+
+ *p_u64 = (uint64_t)strtoull(value, NULL, 0);
+
+ return 0;
+}
+
+static struct rte_eth_dev *
+cvio_eth_dev_alloc(const char *name)
+{
+ struct rte_eth_dev *eth_dev;
+ struct rte_eth_dev_data *data;
+ struct virtio_hw *hw;
+
+ eth_dev = rte_eth_dev_allocate(name, RTE_ETH_DEV_VIRTUAL);
+ if (eth_dev == NULL)
+ rte_panic("cannot alloc rte_eth_dev\n");
+
+ data = eth_dev->data;
+
+ hw = rte_zmalloc(NULL, sizeof(*hw), 0);
+ if (!hw)
+ rte_panic("malloc virtio_hw failed\n");
+
+ data->dev_private = hw;
+ data->numa_node = SOCKET_ID_ANY;
+ eth_dev->pci_dev = NULL;
+ /* will be used in virtio_dev_info_get() */
+ eth_dev->driver = &rte_virtio_pmd;
+ /* TODO: eth_dev->link_intr_cbs */
+ return eth_dev;
+}
+
+#define CVIO_DEF_CQ_EN 0
+#define CVIO_DEF_Q_NUM 1
+#define CVIO_DEF_Q_SZ 256
+/*
+ * Dev initialization routine. Invoked once for each virtio vdev at
+ * EAL init time, see rte_eal_dev_init().
+ * Returns 0 on success.
+ */
+static int
+rte_cvio_pmd_devinit(const char *name, const char *params)
+{
+ struct rte_kvargs *kvlist = NULL;
+ struct rte_eth_dev *eth_dev = NULL;
+ uint64_t nb_rx = CVIO_DEF_Q_NUM;
+ uint64_t nb_tx = CVIO_DEF_Q_NUM;
+ uint64_t nb_cq = CVIO_DEF_CQ_EN;
+ uint64_t queue_num = CVIO_DEF_Q_SZ;
+ char sock_path[256];
+ char mac_addr[32];
+ int flag_mac = 0;
+
+ if (params == NULL || params[0] == '\0')
+ rte_panic("arg %s is mandatory for eth_cvio\n",
+ ETH_CVIO_ARG_QUEUE_SIZE);
+
+ kvlist = rte_kvargs_parse(params, valid_args);
+ if (!kvlist)
+ rte_panic("error when parsing param\n");
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_PATH) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_PATH,
+ &get_string_arg, sock_path);
+ else
+ rte_panic("arg %s is mandatory for eth_cvio\n",
+ ETH_CVIO_ARG_QUEUE_SIZE);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_MAC) == 1) {
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_MAC,
+ &get_string_arg, mac_addr);
+ flag_mac = 1;
+ }
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_QUEUE_SIZE) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_QUEUE_SIZE,
+ &get_integer_arg, &queue_num);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_RX_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_RX_NUM,
+ &get_integer_arg, &nb_rx);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_TX_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_TX_NUM,
+ &get_integer_arg, &nb_tx);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_CQ_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_CQ_NUM,
+ &get_integer_arg, &nb_cq);
+
+ eth_dev = cvio_eth_dev_alloc(name);
+
+ virtio_vdev_init(eth_dev->data, sock_path,
+ nb_rx, nb_tx, nb_cq, queue_num,
+ (flag_mac) ? mac_addr : NULL);
+
+ /* originally, this will be called in rte_eal_pci_probe() */
+ eth_virtio_dev_init(eth_dev);
+
+ return 0;
+}
+
+static int
+rte_cvio_pmd_devuninit(const char *name)
+{
+ /* TODO: if it's last one, memory init, free memory */
+ rte_panic("%s: %s", __func__, name);
+ return 0;
+}
+
+static struct rte_driver rte_cvio_driver = {
+ .name = "eth_cvio",
+ .type = PMD_VDEV,
+ .init = rte_cvio_pmd_devinit,
+ .uninit = rte_cvio_pmd_devuninit,
+};
+
+PMD_REGISTER_DRIVER(rte_cvio_driver);
+
+#endif
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9e1ecb3..90890b4 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -126,4 +126,5 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf
**tx_pkts,
void virtio_vdev_init(struct rte_eth_dev_data *data, const char *path,
int nb_rx, int nb_tx, int nb_cq, int queue_num, char *mac);
#endif
+
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index af05ae2..d79bd05 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -249,31 +249,31 @@ uint32_t virtio_ioport_read(struct virtio_hw *, uint64_t);
void virtio_ioport_write(struct virtio_hw *, uint64_t, uint32_t);
#define VIRTIO_READ_REG_1(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inb((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_1(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outb_p((unsigned char)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))
#define VIRTIO_READ_REG_2(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inw((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_2(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outw_p((unsigned short)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))
#define VIRTIO_READ_REG_4(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inl((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))
These bracket fixups should be squashed into #3
#else /* RTE_VIRTIO_VDEV */
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 74b39ef..dd07ba7 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -191,8 +191,7 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf
*cookie)
start_dp = vq->vq_ring.desc;
start_dp[idx].addr =
- (uint64_t)(cookie->buf_physaddr + RTE_PKTMBUF_HEADROOM
- - hw->vtnet_hdr_size);
+ RTE_MBUF_DATA_DMA_ADDR(cookie, vq->offset) - hw->vtnet_hdr_size;
start_dp[idx].len =
cookie->buf_len - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
start_dp[idx].flags = VRING_DESC_F_WRITE;
@@ -237,7 +236,7 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct rte_mbuf *cookie)
for (; ((seg_num > 0) && (cookie != NULL)); seg_num--) {
idx = start_dp[idx].next;
- start_dp[idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie);
+ start_dp[idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie, txvq->offset);
start_dp[idx].len = cookie->data_len;
start_dp[idx].flags = VRING_DESC_F_NEXT;
cookie = cookie->next;
@@ -343,7 +342,7 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
if (use_simple_rxtx) {
int mid_idx = vq->vq_nentries >> 1;
@@ -366,12 +365,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else {
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
- vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
+ vq->vq_ring_mem >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
}
}
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c
b/drivers/net/virtio/virtio_rxtx_simple.c
index ff3c11a..3a14a4e 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -80,8 +80,8 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
vq->sw_ring[desc_idx] = cookie;
start_dp = vq->vq_ring.desc;
- start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie, vq->offset)
+ - sizeof(struct virtio_net_hdr);
start_dp[desc_idx].len = cookie->buf_len -
RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
@@ -118,9 +118,8 @@ virtio_rxq_rearm_vec(struct virtqueue *rxvq)
p = (uintptr_t)&sw_ring[i]->rearm_data;
*(uint64_t *)p = rxvq->mbuf_initializer;
- start_dp[i].addr =
- (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].addr = RTE_MBUF_DATA_DMA_ADDR(sw_ring[i], rxvq->offset)
+ - sizeof(struct virtio_net_hdr);
start_dp[i].len = sw_ring[i]->buf_len -
RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
}
@@ -366,7 +365,7 @@ virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
for (i = 0; i < nb_tail; i++) {
start_dp[desc_idx].addr =
- RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts, txvq->offset);
start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
tx_pkts++;
desc_idx++;
@@ -377,7 +376,8 @@ virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
for (i = 0; i < nb_commit; i++)
txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
for (i = 0; i < nb_commit; i++) {
- start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts,
+ txvq->offset);
start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
tx_pkts++;
desc_idx++;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 61b3137..dc0b656 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -66,8 +66,14 @@ struct rte_mbuf;
#define VIRTQUEUE_MAX_NAME_SZ 32
-#define RTE_MBUF_DATA_DMA_ADDR(mb) \
+#ifdef RTE_VIRTIO_VDEV
+#define RTE_MBUF_DATA_DMA_ADDR(mb, offset) \
+ (uint64_t)((uintptr_t)(*(void **)((uintptr_t)mb + offset)) \
+ + (mb)->data_off)
+#else
+#define RTE_MBUF_DATA_DMA_ADDR(mb, offset) \
(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
+#endif /* RTE_VIRTIO_VDEV */
#define VTNET_SQ_RQ_QUEUE_IDX 0
#define VTNET_SQ_TQ_QUEUE_IDX 1
@@ -167,7 +173,8 @@ struct virtqueue {
void *vq_ring_virt_mem; /**< linear address of vring*/
unsigned int vq_ring_size;
- phys_addr_t vq_ring_mem; /**< physical address of vring */
+ phys_addr_t vq_ring_mem; /**< phys address of vring for pci dev,
+ virt addr of vring for vdev */
struct vring vq_ring; /**< vring keeping desc, used and avail */
uint16_t vq_free_cnt; /**< num of desc available */
@@ -186,8 +193,10 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint16_t offset; /**< relative offset to obtain addr in mbuf */
uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */
+ void *virtio_net_hdr_vaddr; /**< linear address of vring*/
struct rte_mbuf **sw_ring; /**< RX software ring. */
/* dummy mbuf, for wraparound when processing RX ring. */
--
2.1.4
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2016-01-12 07:59:24 UTC
Permalink
Post by Pavel Fedin
Hello!
See inline
Hi,

Please strip unrelated context, so that people could reach to your
comments as quick as possible, otherwise, people could easily get
lost from the long patch.
Post by Pavel Fedin
-----Original Message-----
+ struct rte_mbuf *m = NULL;
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->offset = (uintptr_t)&m->buf_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ vq->offset = (uintptr_t)&m->buf_physaddr;
Not sure, but shouldn't these be swapped? Originally, for PCI devices, we used buf_physaddr.
And this reply just servers as an example only :)

--yliu
Tan, Jianfeng
2016-01-12 08:39:34 UTC
Permalink
Hi Fedin,
Post by Pavel Fedin
Hello!
See inline
Post by Jianfeng Tan
...
}
+ struct rte_mbuf *m = NULL;
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->offset = (uintptr_t)&m->buf_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ vq->offset = (uintptr_t)&m->buf_physaddr;
Not sure, but shouldn't these be swapped? Originally, for PCI devices, we used buf_physaddr.
Oops, seems that you are right. I'm trying to figure out why I can rx/tx
pkts using the wrong version.
Post by Pavel Fedin
Post by Jianfeng Tan
#define VIRTIO_READ_REG_1(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inb((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_1(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outb_p((unsigned char)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))
#define VIRTIO_READ_REG_2(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inw((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_2(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outw_p((unsigned short)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))
#define VIRTIO_READ_REG_4(hw, reg) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
inl((VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_read(hw, reg)
+ :virtio_ioport_read(hw, reg))
#define VIRTIO_WRITE_REG_4(hw, reg, value) \
- (hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
+ ((hw->io_base != VIRTIO_VDEV_IO_BASE) ? \
outl_p((unsigned int)(value), (VIRTIO_PCI_REG_ADDR((hw), (reg)))) \
- :virtio_ioport_write(hw, reg, value)
+ :virtio_ioport_write(hw, reg, value))
These bracket fixups should be squashed into #3
I'll rewrite this into function pointers according to Yuanhan's patch
for virtio 1.0.

Thanks,
Jianfeng
Tan, Jianfeng
2016-01-12 09:15:32 UTC
Permalink
Hi Fedin,
Post by Tan, Jianfeng
Hi Fedin,
Post by Pavel Fedin
Hello!
See inline
...
}
+ struct rte_mbuf *m = NULL;
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->offset = (uintptr_t)&m->buf_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else {
+ vq->offset = (uintptr_t)&m->buf_physaddr;
Not sure, but shouldn't these be swapped? Originally, for PCI
devices, we used buf_physaddr.
Oops, seems that you are right. I'm trying to figure out why I can
rx/tx pkts using the wrong version.
I figure out why. When we run apps without root privilege, mempool's
elt_pa is assigned the same of elt_va_start. So it happens to be right
value to translate addresses. But it's definitely a bug. Thanks for
pointing this out.

Thanks,
Jianfeng
Qiu, Michael
2016-01-27 03:10:54 UTC
Permalink
Post by Jianfeng Tan
Add a new virtual device named eth_cvio, it can be used just like
eth_ring, eth_null, etc.
- rx (optional, 1 by default): number of rx, only allowed to be
1 for now.
- tx (optional, 1 by default): number of tx, only allowed to be
1 for now.
From APP side, virtio is something HW, in your implementation rx/tx is
max queue numbers virtio supported. Does it make sense?

Why need user tell HW, how much queues it support? We'd better make it
un-configurable, only let users query it like the real HW, and then
decide how much queues it need to enable.
Post by Jianfeng Tan
- cq (optional, 0 by default): if ctrl queue is enabled, not
supported for now.
- mac (optional): mac address, random value will be given if not
specified.
- queue_num (optional, 256 by default): size of virtqueue.
Better change it to queue_size.

Thanks,
Michael
Post by Jianfeng Tan
vhost-user is used if the given path points to
a unix socket; vhost-net is used if the given
path points to a char device.
The major difference with original virtio for vm is that, here we
use virtual address instead of physical address for vhost to
calculate relative address.
When enable CONFIG_RTE_VIRTIO_VDEV (enabled by default), the compiled
library can be used in both VM and container environment.
a. Use vhost-net as a backend
sudo numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 \
-m 1024 --no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=/dev/vhost-net \
-- -p 0x1
b. Use vhost-user as a backend
numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 -m 1024 \
--no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=<path_to_vhost_user> \
-- -p 0x1
---
Pavel Fedin
2016-01-11 14:21:27 UTC
Permalink
Hello!
Post by Jianfeng Tan
This patchset is to provide high performance networking interface (virtio)
for container-based DPDK applications. The way of starting DPDK apps in
containers with ownership of NIC devices exclusively is beyond the scope.
The basic idea here is to present a new virtual device (named eth_cvio),
which can be discovered and initialized in container-based DPDK apps using
rte_eal_init(). To minimize the change, we reuse already-existing virtio
frontend driver code (driver/net/virtio/).
With the aforementioned fixes i tried to run it inside libvirt-lxc. I got the following:
a) With hugepages - "abort with 256 hugepage files exceed the maximum of 8 for vhost-user" - i set -m 512
b) With --single-file - ovs runs, but doesn't get any packets at all. When i try to ping the container from within host side, it
counts drops on vhost-user port.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Tan, Jianfeng
2016-01-11 15:53:08 UTC
Permalink
Hi Fedin,
Post by Pavel Fedin
a) With hugepages - "abort with 256 hugepage files exceed the maximum of 8 for vhost-user" - i set -m 512
This is currently a known issue, we have discussed in another thread
with Tetsuya.
Post by Pavel Fedin
b) With --single-file - ovs runs, but doesn't get any packets at all. When i try to ping the container from within host side, it
counts drops on vhost-user port.
Can you check the OVS in host side, if it prints out message of "virtio
is now ready for processing"?

Thanks,
Jianfeng
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 07:38:26 UTC
Permalink
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
b) With --single-file - ovs runs, but doesn't get any packets at all. When i try to ping
the container from within host side, it
Post by Pavel Fedin
counts drops on vhost-user port.
Can you check the OVS in host side, if it prints out message of "virtio
is now ready for processing"?
No, i get errors:
--- cut ---
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: new virtio connection is 38
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: new device, handle is 0
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_OWNER
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
Jan 12 10:27:43 nfv_test_x86_64 kernel: device ovs-netdev entered promiscuous mode
Jan 12 10:27:43 nfv_test_x86_64 kernel: device ovs0 entered promiscuous mode
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: mapped region 0 fd:39 to:0x7f079c600000 sz:0x200000 off:0x0
align:0x200000
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: vring call idx:0 file:49
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 10:27:43 nfv_test_x86_64 ovs-vswitchd[18858]: VHOST_CONFIG: (0) Failed to find desc ring address.
--- cut ---

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Rich Lane
2016-01-12 08:14:09 UTC
Permalink
See my reply to "mem: add API to obstain memory-backed file info" for a
workaround. With fixes for that and the TUNSETVNETHDRSZ issue I was able to
get traffic running over vhost-user.
Pavel Fedin
2016-01-12 08:39:18 UTC
Permalink
Hello!
See my reply to "mem: add API to obstain memory-backed file info" for a workaround. With fixes for that and the TUNSETVNETHDRSZ issue I was able to
get traffic running over vhost-user.
With ovs or test apps? I still have problems with ovs after this. Packets go from host to container, but not back. Here is host-side log (i added also GPA display in order to debug the problem you pointed at):
--- cut ---
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: new virtio connection is 38
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: new device, handle is 0
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_OWNER
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
Jan 12 11:23:32 nfv_test_x86_64 kernel: device ovs-netdev entered promiscuous mode
Jan 12 11:23:32 nfv_test_x86_64 kernel: device ovs0 entered promiscuous mode
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: mapped region 0 fd:39 to:0x7f0ddea00000 sz:0x20000000 off:0x0 GPA:0x7f7159000000 align:0x200000
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:0 file:49
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:0 file:50
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:1 file:51
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:1 file:52
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is now ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: mapped region 0 fd:53 to:0x7f0ddea00000 sz:0x20000000 off:0x0 GPA:0x7f7159000000 align:0x200000
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:0 file:39
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:0 file:49
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is now ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:1 file:50
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:1 file:51
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is now ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:3 file:52
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:3 file:56
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:5 file:57
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:5 file:58
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:7 file:59
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:7 file:60
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:9 file:61
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:9 file:62
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:11 file:63
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:11 file:64
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:13 file:65
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:13 file:66
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:15 file:67
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:15 file:68
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring call idx:17 file:69
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: vring kick idx:17 file:70
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: virtio is not ready for processing.
Jan 12 11:23:32 nfv_test_x86_64 ovs-vswitchd[3461]: VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
--- cut ---

Note that during multiqueue setup host state reverts back from "now ready for processing" to "not ready for processing". I guess this is the reason for the problem.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Tan, Jianfeng
2016-01-12 08:51:01 UTC
Permalink
Hi Fedin,
Post by Pavel Fedin
Hello!
See my reply to "mem: add API to obstain memory-backed file info" for a workaround. With fixes for that and the TUNSETVNETHDRSZ issue I was able to
get traffic running over vhost-user.
--- cut ---
...
--- cut ---
Note that during multiqueue setup host state reverts back from "now ready for processing" to "not ready for processing". I guess this is the reason for the problem.
Your guess makes sense because current implementation does not support
multi-queues.

From you log, only 0 and 1 are "ready for processing"; others are "not
ready for processing".

Thanks,
Jianfeng
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 10:48:45 UTC
Permalink
Hello!
Post by Tan, Jianfeng
Your guess makes sense because current implementation does not support
multi-queues.
From you log, only 0 and 1 are "ready for processing"; others are "not
ready for processing".
Yes, and if study it even more carefully, we see that we initialize all tx queues but only a single rx queue (#0).
After some more code browsing and comparing the two patchsets i figured out that the problem is caused by inappropriate VIRTIO_NET_F_CTRL_VQ flag. In your RFC you used different capability set, while in v1 you seem to have forgotten about this.
I suggest to temporarily move hw->guest_features assignment out of virtio_negotiate_features() into the caller, where we have eth_dev->dev_type, and can choose the right set depending on it.

With all mentioned fixes i've got the ping running.
Tested-by: Pavel Fedin <***@samsung.com>

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Amit Tomer
2016-01-12 14:45:13 UTC
Permalink
Hello,

I run l2fwd from inside docker with following logs:

But, don't see Port statistics gets updated ?

#/home/ubuntu/dpdk# sudo docker run -i -t -v
/home/ubuntu/dpdk/usvhost:/usr/src/dpdk/usvhost l4
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Detected lcore 2 as core 2 on socket 0
EAL: Detected lcore 3 as core 3 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 5 on socket 0
EAL: Detected lcore 6 as core 6 on socket 0
EAL: Detected lcore 7 as core 7 on socket 0
EAL: Detected lcore 8 as core 8 on socket 0
EAL: Setting up physically contiguous memory...
EAL: TSC frequency is ~99999 KHz
EAL: Master lcore 1 is ready (tid=b5968000;cpuset=[1])
Notice: odd number of ports in portmask.
Lcore 1: RX port 0
Initializing port 0... done:
Port 0, MAC address: F6:9F:7A:47:A4:99

Checking link statusdone
Port 0 Link Up - speed 10000 Mbps - full-duplex
L2FWD: entering main loop on lcore 1
L2FWD: -- lcoreid=1 portid=0


Port statistics ====================================
Statistics for port 0 ------------------------------
Packets sent: 0
Packets received: 0
Packets dropped: 0
Aggregate statistics ===============================
Total packets sent: 0
Total packets received: 0
Total packets dropped: 0
====================================================

Host side logs after running

# ./vhost-switch -c 0x3 f -n 4 --socket-mem 2048 --huge-dir
/dev/hugepages -- -p 0x1 --dev-basename usvhost

PMD: eth_ixgbe_dev_init(): MAC: 4, PHY: 3
PMD: eth_ixgbe_dev_init(): port 1 vendorID=0x8086 deviceID=0x1528
pf queue num: 0, configured vmdq pool num: 64, each vmdq pool has 2 queues
VHOST_PORT: Max virtio devices supported: 64
VHOST_PORT: Port 0 MAC: d8 9d 67 ee 55 f0
VHOST_PORT: Skipping disabled port 1
VHOST_DATA: Procesing on Core 1 started
VHOST_CONFIG: socket created, fd:20
VHOST_CONFIG: bind to usvhost
VHOST_CONFIG: new virtio connection is 21
VHOST_CONFIG: new device, handle is 0
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
VHOST_CONFIG: mapped region 0 fd:22 to 0x7f34000000 sz:0x4000000 off:0x0
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:0 file:23
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
VHOST_CONFIG: vring kick idx:0 file:24
VHOST_CONFIG: virtio isn't ready for processing.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
VHOST_CONFIG: vring kick idx:1 file:26
VHOST_CONFIG: virtio is now ready for processing.
VHOST_DATA: (0) Device has been added to data core 1

Could anyone please point out, how it can be tested further(how can
traffic be sent across host and container) ?

Thanks,
Amit.
Post by Pavel Fedin
Hello!
Post by Tan, Jianfeng
Your guess makes sense because current implementation does not support
multi-queues.
From you log, only 0 and 1 are "ready for processing"; others are "not
ready for processing".
Yes, and if study it even more carefully, we see that we initialize all tx queues but only a single rx queue (#0).
After some more code browsing and comparing the two patchsets i figured out that the problem is caused by inappropriate VIRTIO_NET_F_CTRL_VQ flag. In your RFC you used different capability set, while in v1 you seem to have forgotten about this.
I suggest to temporarily move hw->guest_features assignment out of virtio_negotiate_features() into the caller, where we have eth_dev->dev_type, and can choose the right set depending on it.
With all mentioned fixes i've got the ping running.
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2016-01-12 14:50:29 UTC
Permalink
Hello!
Post by Amit Tomer
Could anyone please point out, how it can be tested further(how can
traffic be sent across host and container) ?
Have you applied all three fixes discussed here?

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
Amit Tomer
2016-01-12 14:58:10 UTC
Permalink
Hello,
Post by Pavel Fedin
Have you applied all three fixes discussed here?
I am running it with, only RFC patches applied with "--no-huge" in l2fwd.

Thanks
Amit.
Tan, Jianfeng
2016-01-12 14:53:11 UTC
Permalink
Hello!
Post by Amit Tomer
Hello,
But, don't see Port statistics gets updated ?
In vhost-switch, it judges if a virtio device is ready for processing
after receiving
a pkt from virtio device. So you'd better construct a pkt, and send it
out firstly
in l2fwd.

Thanks,
Jianfeng
Amit Tomer
2016-01-12 15:11:31 UTC
Permalink
Hello,
In vhost-switch, it judges if a virtio device is ready for processing after
receiving
a pkt from virtio device. So you'd better construct a pkt, and send it out
firstly
in l2fwd.
I tried to ping the socket interface from host for the same purpose
but it didn't work.

Could you please suggest some other approach for achieving same(how
pkt can be sent out to l2fwd)?

Also, before trying this, I have verified that vhost-switch is working
ok with testpmd .

Thanks,
Amit.
Tan, Jianfeng
2016-01-12 16:18:34 UTC
Permalink
Hello,
Post by Amit Tomer
Hello,
In vhost-switch, it judges if a virtio device is ready for processing after
receiving
a pkt from virtio device. So you'd better construct a pkt, and send it out
firstly
in l2fwd.
I tried to ping the socket interface from host for the same purpose
but it didn't work.
Could you please suggest some other approach for achieving same(how
pkt can be sent out to l2fwd)?
Also, before trying this, I have verified that vhost-switch is working
ok with testpmd .
Thanks,
Amit.
You can use below patch for l2fwd to send out an arp packet when it gets
started.

diff --git a/examples/l2fwd/main.c b/examples/l2fwd/main.c
index 720fd5a..572b1ac 100644
--- a/examples/l2fwd/main.c
+++ b/examples/l2fwd/main.c
@@ -69,6 +69,8 @@
#include <rte_mempool.h>
#include <rte_mbuf.h>

+#define SEND_ARP
+
#define RTE_LOGTYPE_L2FWD RTE_LOGTYPE_USER1

#define NB_MBUF 8192
@@ -185,6 +187,53 @@ print_stats(void)
printf("\n====================================================\n");
}

+#ifdef SEND_ARP
+static void
+dpdk_send_arp(int portid, struct rte_mempool *mp)
+{
+ /*
+ * len = 14 + 46
+ * ARP, Request who-has 10.0.0.1 tell 10.0.0.2, length 46
+ */
+ static const uint8_t arp_request[] = {
+ /*0x0000:*/ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xec, 0xa8,
0x6b, 0xfd, 0x02, 0x29, 0x08, 0x06, 0x00, 0x01,
+ /*0x0010:*/ 0x08, 0x00, 0x06, 0x04, 0x00, 0x01, 0xec, 0xa8,
0x6b, 0xfd, 0x02, 0x29, 0x0a, 0x00, 0x00, 0x01,
+ /*0x0020:*/ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0a, 0x00,
0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ /*0x0030:*/ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00
+ };
+ int ret;
+ struct rte_mbuf *m;
+ struct ether_addr mac_addr;
+ int pkt_len = sizeof(arp_request) - 1;
+
+ m = rte_pktmbuf_alloc(mp);
+
+ memcpy((void *)((uint64_t)m->buf_addr + m->data_off), arp_request,
pkt_len);
+ rte_pktmbuf_pkt_len(m) = pkt_len;
+ rte_pktmbuf_data_len(m) = pkt_len;
+
+ rte_eth_macaddr_get(portid, &mac_addr);
+ memcpy((void *)((uint64_t)m->buf_addr + m->data_off + 6),
&mac_addr, 6);
+
+ ret = rte_eth_tx_burst(portid, 0, &m, 1);
+ if (ret == 1) {
+ printf("arp sent: ok\n");
+ printf("%02x:%02x:%02x:%02x:%02x:%02x\n",
+ mac_addr.addr_bytes[0],
+ mac_addr.addr_bytes[1],
+ mac_addr.addr_bytes[2],
+ mac_addr.addr_bytes[3],
+ mac_addr.addr_bytes[4],
+ mac_addr.addr_bytes[5]);
+ } else {
+ printf("arp sent: fail\n");
+ }
+
+ rte_pktmbuf_free(m);
+}
+#endif
+
+
/* Send the burst of packets on an output interface */
static int
l2fwd_send_burst(struct lcore_queue_conf *qconf, unsigned n, uint8_t port)
@@ -281,6 +330,9 @@ l2fwd_main_loop(void)
portid = qconf->rx_port_list[i];
RTE_LOG(INFO, L2FWD, " -- lcoreid=%u portid=%u\n", lcore_id,
portid);
+#ifdef SEND_ARP
+ dpdk_send_arp(portid, l2fwd_pktmbuf_pool);
+#endif
}

while (1) {
Amit Tomer
2016-01-13 15:00:50 UTC
Permalink
Hello,
Post by Tan, Jianfeng
You can use below patch for l2fwd to send out an arp packet when it gets
started.
I tried to send out arp packet using this patch but buffer allocation
for arp packets itself gets failed:

m = rte_pktmbuf_alloc(mp);

Return a NULL Value.

Thanks,
Amit.
Tan, Jianfeng
2016-01-13 18:41:03 UTC
Permalink
Hi Amit,
Post by Amit Tomer
Hello,
Post by Tan, Jianfeng
You can use below patch for l2fwd to send out an arp packet when it gets
started.
I tried to send out arp packet using this patch but buffer allocation
m = rte_pktmbuf_alloc(mp);
Return a NULL Value.
Can you send out how you start this l2fwd program?

Thanks,
Jianfeng
Post by Amit Tomer
Thanks,
Amit.
Amit Tomer
2016-01-14 09:34:00 UTC
Permalink
Hello,
Post by Tan, Jianfeng
Can you send out how you start this l2fwd program?
This is how, I run l2fwd program.

CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0x3", "-n",
"4","--no-pci",
,"--no-huge","--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/usr/src/dpdk/usvhost",
"--", "-p", "0x1"]

I tried passing "-m 1024" to it but It causes l2fwd killed even before
it could connect to usvhost socket.

Do I need to create Hugepages from Inside Docker container to make use
of Hugepages?

Thanks,
Amit.
Tan, Jianfeng
2016-01-14 11:41:32 UTC
Permalink
Hi Amit,
Post by Amit Tomer
Hello,
Post by Tan, Jianfeng
Can you send out how you start this l2fwd program?
This is how, I run l2fwd program.
CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0x3", "-n",
"4","--no-pci",
,"--no-huge","--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/usr/src/dpdk/usvhost",
"--", "-p", "0x1"]
In this way, you can only get 64M memory. I believe it's too small to
create a l2fwd_pktmbuf_pool in l2fwd.
Post by Amit Tomer
I tried passing "-m 1024" to it but It causes l2fwd killed even before
it could connect to usvhost socket.
In my patch, when --no-huge is specified, I change previous anonymous
mmap into file-baked memory in /dev/shm. And usually, Docker mounts a
64MB-size tmpfs there, so you cannot use -m 1024. If you want to do
that, use -v to substitute the 64MB tmpfs with a bigger tmpfs.
Post by Amit Tomer
Do I need to create Hugepages from Inside Docker container to make use
of Hugepages?
Not necessary. But if you want to use hugepages inside Docker, use -v
option to map a hugetlbfs into containers.

Most importantly, you indeed uncover a bug here. Current implementation
cannot work with tmpfs, because it lacks ftruncate() between open() and
mmap(). It turns out that although mmap() succeeds, the memory cannot be
touched. However, this is not a problem for hugetlbfs. I don't why they
differ like that way. In all, if you want to use no-huge, please add
ftruncate(), I'll fix this in next version.

Thanks,
Jianfeng
Post by Amit Tomer
Thanks,
Amit.
Amit Tomer
2016-01-14 12:03:52 UTC
Permalink
Hello,
Not necessary. But if you want to use hugepages inside Docker, use -v option
to map a hugetlbfs into containers.
I modified Docker command line in order to make use of Hugetlbfs:

CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0x3", "-n",
"4","--no-pci", "--socket-mem","512",
"--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/var/run/usvhost",
"--", "-p", "0x1"]

Then, I run docker :

docker run -i -t --privileged -v /dev/hugepages:/dev/hugepages -v
/home/ubuntu/backup/usvhost:/var/run/usvhost l6

But this is what I see:

EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 48 lcore(s)
EAL: Setting up physically contiguous memory...
EAL: Failed to find phys addr for 2 MB pages
PANIC in rte_eal_init():
Cannot init memory
1: [/usr/src/dpdk/examples/l2fwd/build/l2fwd(rte_dump_stack+0x20) [0x48ea78]]

This is from Host:

# mount | grep hugetlbfs
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
none on /dev/hugepages type hugetlbfs (rw,relatime)

#cat /proc/meminfo | grep Huge
AnonHugePages: 548864 kB
HugePages_Total: 4096
HugePages_Free: 1024
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

What is it, I'm doing wrong here?

Thanks,
Amit
Tan, Jianfeng
2016-01-15 06:39:26 UTC
Permalink
Hi Amit,
Post by Amit Tomer
Hello,
Not necessary. But if you want to use hugepages inside Docker, use -v option
to map a hugetlbfs into containers.
CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0x3", "-n",
"4","--no-pci", "--socket-mem","512",
"--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/var/run/usvhost",
"--", "-p", "0x1"]
For this case, please use --single-file option because it creates much
more than 8 fds, which can be handled by vhost-user sendmsg().
Post by Amit Tomer
docker run -i -t --privileged -v /dev/hugepages:/dev/hugepages -v
/home/ubuntu/backup/usvhost:/var/run/usvhost l6
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 48 lcore(s)
EAL: Setting up physically contiguous memory...
EAL: Failed to find phys addr for 2 MB pages
Cannot init memory
1: [/usr/src/dpdk/examples/l2fwd/build/l2fwd(rte_dump_stack+0x20) [0x48ea78]]
From the log, it's caused by that it still cannot open
/proc/self/pagemap. But it's strange that you already specify --privileged).

Thanks,
Jianfeng
Post by Amit Tomer
# mount | grep hugetlbfs
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
none on /dev/hugepages type hugetlbfs (rw,relatime)
#cat /proc/meminfo | grep Huge
AnonHugePages: 548864 kB
HugePages_Total: 4096
HugePages_Free: 1024
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
What is it, I'm doing wrong here?
Thanks,
Amit
Amit Tomer
2016-01-20 15:19:50 UTC
Permalink
Hello,
For this case, please use --single-file option because it creates much more
than 8 fds, which can be handled by vhost-user sendmsg().
Thanks, I'm able to verify it by sending ARP packet from container to
host on arm64. But sometimes, I do see following message while running
l2fwd in container(pointed by Rich).

EAL: Master lcore 0 is ready (tid=8a7a3000;cpuset=[0])
EAL: lcore 1 is ready (tid=89cdf050;cpuset=[1])
Notice: odd number of ports in portmask.
Lcore 0: RX port 0
Initializing port 0... PANIC in kick_all_vq():
TUNSETVNETHDRSZ failed: Inappropriate ioctl for device

How it could be avoided?

Thanks,
Amit.
Tan, Jianfeng
2016-01-22 06:04:41 UTC
Permalink
Hi Amit,
Post by Amit Tomer
Hello,
For this case, please use --single-file option because it creates much more
than 8 fds, which can be handled by vhost-user sendmsg().
Thanks, I'm able to verify it by sending ARP packet from container to
host on arm64. But sometimes, I do see following message while running
l2fwd in container(pointed by Rich).
EAL: Master lcore 0 is ready (tid=8a7a3000;cpuset=[0])
EAL: lcore 1 is ready (tid=89cdf050;cpuset=[1])
Notice: odd number of ports in portmask.
Lcore 0: RX port 0
TUNSETVNETHDRSZ failed: Inappropriate ioctl for device
How it could be avoided?
Thanks,
Amit.
Thanks for pointing out this bug. Actually it's caused by one of my
fault. So vhost-user cannot work well.
Below change can help start vhost-user.

diff --git a/drivers/net/virtio/vhost.c b/drivers/net/virtio/vhost.c
index e423e02..dbca374 100644
--- a/drivers/net/virtio/vhost.c
+++ b/drivers/net/virtio/vhost.c
@@ -483,8 +483,9 @@ static void kick_all_vq(struct virtio_hw *hw)
uint64_t features = hw->guest_features;
features &= ~(1ull << VIRTIO_NET_F_MAC);
vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
- if (ioctl(hw->backfd, TUNSETVNETHDRSZ, &hw->vtnet_hdr_size) == -1)
- rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+ if (hw->type == VHOST_KERNEL)
+ if (ioctl(hw->backfd, TUNSETVNETHDRSZ,
&hw->vtnet_hdr_size) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n",
strerror(errno));
PMD_DRV_LOG(INFO, "set features:%"PRIx64"\n", features);


Thanks,
Jianfeng
Tetsuya Mukawa
2016-01-12 05:36:25 UTC
Permalink
Post by Jianfeng Tan
This patchset is to provide high performance networking interface (virtio)
for container-based DPDK applications. The way of starting DPDK apps in
containers with ownership of NIC devices exclusively is beyond the scope.
The basic idea here is to present a new virtual device (named eth_cvio),
which can be discovered and initialized in container-based DPDK apps using
rte_eal_init(). To minimize the change, we reuse already-existing virtio
frontend driver code (driver/net/virtio/).
Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
operations into unix socket/cuse protocol, which is originally provided in
QEMU), is integrated in virtio frontend driver. So this converged driver
actually plays the role of original frontend driver and the role of QEMU
device framework.
The major difference lies in how to calculate relative address for vhost.
The principle of virtio is that: based on one or multiple shared memory
segments, vhost maintains a reference system with the base addresses and
length for each segment so that an address from VM comes (usually GPA,
Guest Physical Address) can be translated into vhost-recognizable address
(named VVA, Vhost Virtual Address). To decrease the overhead of address
translation, we should maintain as few segments as possible. In VM's case,
GPA is always locally continuous. In container's case, CVA (Container
a. when set_base_addr, CVA address is used;
b. when preparing RX's descriptors, CVA address is used;
c. when transmitting packets, CVA is filled in TX's descriptors;
d. in TX and CQ's header, CVA is used.
How to share memory? In VM's case, qemu always shares all physical layout
to backend. But it's not feasible for a container, as a process, to share
all virtual memory regions to backend. So only specified virtual memory
regions (with type of shared) are sent to backend. It's a limitation that
only addresses in these areas can be used to transmit or receive packets.
Known issues
a. When used with vhost-net, root privilege is required to create tap
device inside.
b. Control queue and multi-queue are not supported yet.
c. When --single-file option is used, socket_id of the memory may be
wrong. (Use "numactl -N x -m x" to work around this for now)
How to use?
a. Apply this patchset.
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
c. To build a docker image using Dockerfile below.
$: cat ./Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
$: docker build -t dpdk-app-l2fwd .
d. Used with vhost-user
$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
--socket-mem 1024,1024 -- -p 0x1 --stats 1
$: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1
f. Used with vhost-net
$: modprobe vhost
$: modprobe vhost-net
$: docker run -i -t --privileged \
-v /dev/vhost-net:/dev/vhost-net \
-v /dev/net/tun:/dev/net/tun \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
By the way, it's not necessary to run in a container.
mem: add --single-file to create single mem-backed file
mem: add API to obstain memory-backed file info
virtio/vdev: add ways to interact with vhost
virtio/vdev: add a new vdev named eth_cvio
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++++++++
drivers/net/virtio/vhost.h | 192 ++++++++
drivers/net/virtio/virtio_ethdev.c | 338 ++++++++++---
drivers/net/virtio/virtio_ethdev.h | 4 +
drivers/net/virtio/virtio_pci.h | 52 +-
drivers/net/virtio/virtio_rxtx.c | 11 +-
drivers/net/virtio/virtio_rxtx_simple.c | 14 +-
drivers/net/virtio/virtqueue.h | 13 +-
lib/librte_eal/common/eal_common_options.c | 17 +
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 +
lib/librte_eal/common/include/rte_memory.h | 16 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 82 +++-
15 files changed, 1392 insertions(+), 93 deletions(-)
create mode 100644 drivers/net/virtio/vhost.c
create mode 100644 drivers/net/virtio/vhost.h
Hi Jianfeng and Xie,

I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.

BTW, one thing I need to change your memory allocation way is that
mmaped address should be under 44bit(32 + PAGE_SHIFT) to work with my patch.
This is because VIRTIO_PCI_QUEUE_PFN register only accepts such address.
(I may need to add one more EAL parameter like "--mmap-under <address>")

Thanks,
Tetsuya
Tan, Jianfeng
2016-01-12 05:46:51 UTC
Permalink
Hi Tetsuya,
Post by Tetsuya Mukawa
Hi Jianfeng and Xie,
I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.
We also think so. And before you rebase your code, I think we can rely
on Yuanhan's
struct virtio_pci_ops to make the code structure brief and clear, as
discussed in your
patch's thread, i.e., we both rebase our code according to Yuanhan's
code. Is that OK?
Post by Tetsuya Mukawa
BTW, one thing I need to change your memory allocation way is that
mmaped address should be under 44bit(32 + PAGE_SHIFT) to work with my patch.
This is because VIRTIO_PCI_QUEUE_PFN register only accepts such address.
(I may need to add one more EAL parameter like "--mmap-under <address>")
It makes sense.

Thanks,
Jianfeng
Post by Tetsuya Mukawa
Thanks,
Tetsuya
Tetsuya Mukawa
2016-01-12 06:01:01 UTC
Permalink
Post by Tan, Jianfeng
Hi Tetsuya,
Post by Tetsuya Mukawa
Hi Jianfeng and Xie,
I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.
We also think so. And before you rebase your code, I think we can rely
on Yuanhan's
struct virtio_pci_ops to make the code structure brief and clear, as
discussed in your
patch's thread, i.e., we both rebase our code according to Yuanhan's
code. Is that OK?
Yes, I agree with it.

Thanks,
Tetsuya
Post by Tan, Jianfeng
Post by Tetsuya Mukawa
BTW, one thing I need to change your memory allocation way is that
mmaped address should be under 44bit(32 + PAGE_SHIFT) to work with my patch.
This is because VIRTIO_PCI_QUEUE_PFN register only accepts such address.
(I may need to add one more EAL parameter like "--mmap-under <address>")
It makes sense.
Thanks,
Jianfeng
Post by Tetsuya Mukawa
Thanks,
Tetsuya
Yuanhan Liu
2016-01-12 06:14:34 UTC
Permalink
Post by Tetsuya Mukawa
Post by Tan, Jianfeng
Hi Tetsuya,
Post by Tetsuya Mukawa
Hi Jianfeng and Xie,
I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.
We also think so. And before you rebase your code, I think we can rely
on Yuanhan's
struct virtio_pci_ops to make the code structure brief and clear, as
discussed in your
patch's thread, i.e., we both rebase our code according to Yuanhan's
code. Is that OK?
Yes, I agree with it.
I will send v2 out today, and hopefully someone will ACK and test it
soon. After that, I'm also hoping Thomas could do a quick merge then.

--yliu
Tetsuya Mukawa
2016-01-12 06:26:47 UTC
Permalink
Post by Yuanhan Liu
Post by Tetsuya Mukawa
Post by Tan, Jianfeng
Hi Tetsuya,
Post by Tetsuya Mukawa
Hi Jianfeng and Xie,
I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.
We also think so. And before you rebase your code, I think we can rely
on Yuanhan's
struct virtio_pci_ops to make the code structure brief and clear, as
discussed in your
patch's thread, i.e., we both rebase our code according to Yuanhan's
code. Is that OK?
Yes, I agree with it.
I will send v2 out today, and hopefully someone will ACK and test it
soon. After that, I'm also hoping Thomas could do a quick merge then.
--yliu
Hi Yuanhan,

Thanks, I will review and test it also.

Tetsuya
Yuanhan Liu
2016-01-12 06:29:54 UTC
Permalink
Post by Tan, Jianfeng
Post by Yuanhan Liu
Post by Tetsuya Mukawa
Post by Tan, Jianfeng
Hi Tetsuya,
Post by Tetsuya Mukawa
Hi Jianfeng and Xie,
I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.
We also think so. And before you rebase your code, I think we can rely
on Yuanhan's
struct virtio_pci_ops to make the code structure brief and clear, as
discussed in your
patch's thread, i.e., we both rebase our code according to Yuanhan's
code. Is that OK?
Yes, I agree with it.
I will send v2 out today, and hopefully someone will ACK and test it
soon. After that, I'm also hoping Thomas could do a quick merge then.
--yliu
Hi Yuanhan,
Thanks, I will review and test it also.
Appreciate that!

--yliu
Xie, Huawei
2016-01-20 03:48:07 UTC
Permalink
Post by Tetsuya Mukawa
Hi Jianfeng and Xie,
I guess my implementation and yours have a lot of common code, so I will
try to rebase my patch on yours.
BTW, one thing I need to change your memory allocation way is that
mmaped address should be under 44bit(32 + PAGE_SHIFT) to work with my patch.
This is because VIRTIO_PCI_QUEUE_PFN register only accepts such address.
(I may need to add one more EAL parameter like "--mmap-under <address>")
I believe it is OK to mmap under 44bit, but better check the user space
address space layout.
Post by Tetsuya Mukawa
Thanks,
Tetsuya
Qiu, Michael
2016-01-26 06:02:03 UTC
Permalink
Post by Jianfeng Tan
This patchset is to provide high performance networking interface (virtio)
for container-based DPDK applications. The way of starting DPDK apps in
containers with ownership of NIC devices exclusively is beyond the scope.
The basic idea here is to present a new virtual device (named eth_cvio),
which can be discovered and initialized in container-based DPDK apps using
rte_eal_init(). To minimize the change, we reuse already-existing virtio
frontend driver code (driver/net/virtio/).
Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
operations into unix socket/cuse protocol, which is originally provided in
QEMU), is integrated in virtio frontend driver. So this converged driver
actually plays the role of original frontend driver and the role of QEMU
device framework.
The major difference lies in how to calculate relative address for vhost.
The principle of virtio is that: based on one or multiple shared memory
segments, vhost maintains a reference system with the base addresses and
length for each segment so that an address from VM comes (usually GPA,
Guest Physical Address) can be translated into vhost-recognizable address
(named VVA, Vhost Virtual Address). To decrease the overhead of address
translation, we should maintain as few segments as possible. In VM's case,
GPA is always locally continuous. In container's case, CVA (Container
a. when set_base_addr, CVA address is used;
b. when preparing RX's descriptors, CVA address is used;
c. when transmitting packets, CVA is filled in TX's descriptors;
d. in TX and CQ's header, CVA is used.
How to share memory? In VM's case, qemu always shares all physical layout
to backend. But it's not feasible for a container, as a process, to share
all virtual memory regions to backend. So only specified virtual memory
regions (with type of shared) are sent to backend. It's a limitation that
only addresses in these areas can be used to transmit or receive packets.
Known issues
a. When used with vhost-net, root privilege is required to create tap
device inside.
b. Control queue and multi-queue are not supported yet.
c. When --single-file option is used, socket_id of the memory may be
wrong. (Use "numactl -N x -m x" to work around this for now)
How to use?
a. Apply this patchset.
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
c. To build a docker image using Dockerfile below.
$: cat ./Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
$: docker build -t dpdk-app-l2fwd .
d. Used with vhost-user
$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
--socket-mem 1024,1024 -- -p 0x1 --stats 1
$: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1
f. Used with vhost-net
$: modprobe vhost
$: modprobe vhost-net
$: docker run -i -t --privileged \
-v /dev/vhost-net:/dev/vhost-net \
-v /dev/net/tun:/dev/net/tun \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
We'd better add a ifname, like
--vdev=eth_cvio0,path=/dev/vhost-net,ifname=tap0, so that user could add
the tap to the bridge first.

Thanks,
Michael
Post by Jianfeng Tan
By the way, it's not necessary to run in a container.
mem: add --single-file to create single mem-backed file
mem: add API to obstain memory-backed file info
virtio/vdev: add ways to interact with vhost
virtio/vdev: add a new vdev named eth_cvio
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++++++++
drivers/net/virtio/vhost.h | 192 ++++++++
drivers/net/virtio/virtio_ethdev.c | 338 ++++++++++---
drivers/net/virtio/virtio_ethdev.h | 4 +
drivers/net/virtio/virtio_pci.h | 52 +-
drivers/net/virtio/virtio_rxtx.c | 11 +-
drivers/net/virtio/virtio_rxtx_simple.c | 14 +-
drivers/net/virtio/virtqueue.h | 13 +-
lib/librte_eal/common/eal_common_options.c | 17 +
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 +
lib/librte_eal/common/include/rte_memory.h | 16 +
lib/librte_eal/linuxapp/eal/eal_memory.c | 82 +++-
15 files changed, 1392 insertions(+), 93 deletions(-)
create mode 100644 drivers/net/virtio/vhost.c
create mode 100644 drivers/net/virtio/vhost.h
Tan, Jianfeng
2016-01-26 06:09:13 UTC
Permalink
Hi Michael,
...
Post by Qiu, Michael
Post by Jianfeng Tan
f. Used with vhost-net
$: modprobe vhost
$: modprobe vhost-net
$: docker run -i -t --privileged \
-v /dev/vhost-net:/dev/vhost-net \
-v /dev/net/tun:/dev/net/tun \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
We'd better add a ifname, like
--vdev=eth_cvio0,path=/dev/vhost-net,ifname=tap0, so that user could add
the tap to the bridge first.
That's an awesome suggestion.

Thanks,
Jianfeng
Post by Qiu, Michael
Thanks,
Michael
Jianfeng Tan
2016-02-05 11:20:23 UTC
Permalink
v1->v2:
- Rebase on the patchset of virtio 1.0 support.
- Fix cannot create non-hugepage memory.
- Fix wrong size of memory region when "single-file" is used.
- Fix setting of offset in virtqueue to use virtual address.
- Fix setting TUNSETVNETHDRSZ in vhost-user's branch.
- Add mac option to specify the mac address of this virtual device.
- Update doc.

This patchset is to provide high performance networking interface (virtio)
for container-based DPDK applications. The way of starting DPDK apps in
containers with ownership of NIC devices exclusively is beyond the scope.
The basic idea here is to present a new virtual device (named eth_cvio),
which can be discovered and initialized in container-based DPDK apps using
rte_eal_init(). To minimize the change, we reuse already-existing virtio
frontend driver code (driver/net/virtio/).

Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
operations into unix socket/cuse protocol, which is originally provided in
QEMU), is integrated in virtio frontend driver. So this converged driver
actually plays the role of original frontend driver and the role of QEMU
device framework.

The major difference lies in how to calculate relative address for vhost.
The principle of virtio is that: based on one or multiple shared memory
segments, vhost maintains a reference system with the base addresses and
length for each segment so that an address from VM comes (usually GPA,
Guest Physical Address) can be translated into vhost-recognizable address
(named VVA, Vhost Virtual Address). To decrease the overhead of address
translation, we should maintain as few segments as possible. In VM's case,
GPA is always locally continuous. In container's case, CVA (Container
Virtual Address) can be used. Specifically:
a. when set_base_addr, CVA address is used;
b. when preparing RX's descriptors, CVA address is used;
c. when transmitting packets, CVA is filled in TX's descriptors;
d. in TX and CQ's header, CVA is used.

How to share memory? In VM's case, qemu always shares all physical layout
to backend. But it's not feasible for a container, as a process, to share
all virtual memory regions to backend. So only specified virtual memory
regions (with type of shared) are sent to backend. It's a limitation that
only addresses in these areas can be used to transmit or receive packets.

Known issues

a. When used with vhost-net, root privilege is required to create tap
device inside.
b. Control queue and multi-queue are not supported yet.
c. When --single-file option is used, socket_id of the memory may be
wrong. (Use "numactl -N x -m x" to work around this for now)

How to use?

a. Apply this patchset.

b. To compile container apps:
$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc

c. To build a docker image using Dockerfile below.
$: cat ./Dockerfile
FROM ubuntu:latest
WORKDIR /usr/src/dpdk
COPY . /usr/src/dpdk
ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
$: docker build -t dpdk-app-l2fwd .

d. Used with vhost-user
$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
--socket-mem 1024,1024 -- -p 0x1 --stats 1
$: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1

f. Used with vhost-net
$: modprobe vhost
$: modprobe vhost-net
$: docker run -i -t --privileged \
-v /dev/vhost-net:/dev/vhost-net \
-v /dev/net/tun:/dev/net/tun \
-v /dev/hugepages:/dev/hugepages \
dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1

By the way, it's not necessary to run in a container.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>

Jianfeng Tan (5):
mem: add --single-file to create single mem-backed file
mem: add API to obtain memory-backed file info
virtio/vdev: add embeded device emulation
virtio/vdev: add a new vdev named eth_cvio
docs: add release note for virtio for container

config/common_linuxapp | 5 +
doc/guides/rel_notes/release_2_3.rst | 4 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.h | 194 +++++++
drivers/net/virtio/vhost_embedded.c | 809 +++++++++++++++++++++++++++++
drivers/net/virtio/virtio_ethdev.c | 329 +++++++++---
drivers/net/virtio/virtio_ethdev.h | 6 +-
drivers/net/virtio/virtio_pci.h | 15 +-
drivers/net/virtio/virtio_rxtx.c | 6 +-
drivers/net/virtio/virtio_rxtx_simple.c | 13 +-
drivers/net/virtio/virtqueue.h | 15 +-
lib/librte_eal/common/eal_common_options.c | 17 +
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 +
lib/librte_eal/common/include/rte_memory.h | 16 +
lib/librte_eal/linuxapp/eal/eal.c | 4 +-
lib/librte_eal/linuxapp/eal/eal_memory.c | 88 +++-
17 files changed, 1435 insertions(+), 93 deletions(-)
create mode 100644 drivers/net/virtio/vhost.h
create mode 100644 drivers/net/virtio/vhost_embedded.c
--
2.1.4
Jianfeng Tan
2016-02-05 11:20:24 UTC
Permalink
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.

For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.

Known issue:
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
lib/librte_eal/common/eal_common_options.c | 17 ++++++++++
lib/librte_eal/common/eal_internal_cfg.h | 1 +
lib/librte_eal/common/eal_options.h | 2 ++
lib/librte_eal/linuxapp/eal/eal.c | 4 +--
lib/librte_eal/linuxapp/eal/eal_memory.c | 50 +++++++++++++++++++++++++-----
5 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
index 29942ea..65bccbd 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -95,6 +95,7 @@ eal_long_options[] = {
{OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM },
{OPT_VMWARE_TSC_MAP, 0, NULL, OPT_VMWARE_TSC_MAP_NUM },
{OPT_XEN_DOM0, 0, NULL, OPT_XEN_DOM0_NUM },
+ {OPT_SINGLE_FILE, 0, NULL, OPT_SINGLE_FILE_NUM },
{0, 0, NULL, 0 }
};

@@ -897,6 +898,10 @@ eal_parse_common_option(int opt, const char *optarg,
}
break;

+ case OPT_SINGLE_FILE_NUM:
+ conf->single_file = 1;
+ break;
+
/* don't know what to do, leave this to caller */
default:
return 1;
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }

if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink) {
RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
@@ -994,6 +1009,8 @@ eal_common_usage(void)
" -n CHANNELS Number of memory channels\n"
" -m MB Memory to allocate (see also --"OPT_SOCKET_MEM")\n"
" -r RANKS Force number of memory ranks (don't detect)\n"
+ " --"OPT_SINGLE_FILE" Create just single file for shared memory, and \n"
+ " do not promise physical contiguity of memseg\n"
" -b, --"OPT_PCI_BLACKLIST" Add a PCI device in black list.\n"
" Prevent EAL from using this PCI device. The argument\n"
" format is <domain:bus:devid.func>.\n"
diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h
index 5f1367e..9117ed9 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -61,6 +61,7 @@ struct hugepage_info {
*/
struct internal_config {
volatile size_t memory; /**< amount of asked memory */
+ volatile unsigned single_file; /**< mmap all hugepages in single file */
volatile unsigned force_nchannel; /**< force number of channels */
volatile unsigned force_nrank; /**< force number of ranks */
volatile unsigned no_hugetlbfs; /**< true to disable hugetlbfs */
diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h
index a881c62..e5da14a 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -83,6 +83,8 @@ enum {
OPT_VMWARE_TSC_MAP_NUM,
#define OPT_XEN_DOM0 "xen-dom0"
OPT_XEN_DOM0_NUM,
+#define OPT_SINGLE_FILE "single-file"
+ OPT_SINGLE_FILE_NUM,
OPT_LONG_MAX_NUM
};

diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 635ec36..2bc84f7 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -790,6 +790,8 @@ rte_eal_init(int argc, char **argv)
rte_panic("Cannot init IVSHMEM\n");
#endif

+ eal_thread_init_master(rte_config.master_lcore);
+
if (rte_eal_memory_init() < 0)
rte_panic("Cannot init memory\n");

@@ -823,8 +825,6 @@ rte_eal_init(int argc, char **argv)
if (eal_plugins_init() < 0)
rte_panic("Cannot init plugins\n");

- eal_thread_init_master(rte_config.master_lcore);
-
ret = eal_thread_dump_affinity(cpuset, RTE_CPU_AFFINITY_STR_LEN);

RTE_LOG(DEBUG, EAL, "Master lcore %u is ready (tid=%x;cpuset=[%s%s])\n",
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 6008533..68ef49a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
/* get pointer to global configuration */
mcfg = rte_eal_get_configuration()->mem_config;

- /* hugetlbfs can be disabled */
- if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ /* when hugetlbfs is disabled or single-file option is specified */
+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
+ int fd;
+ uint64_t pagesize;
+ unsigned socket_id = rte_socket_id();
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ if (internal_config.no_hugetlbfs) {
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ "/dev/shm", 0);
+ pagesize = RTE_PGSIZE_4K;
+ } else {
+ struct hugepage_info *hpi;
+
+ hpi = &internal_config.hugepage_info[0];
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ hpi->hugedir, 0);
+ pagesize = hpi->hugepage_sz;
+ }
+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
+ __func__, filepath, strerror(errno));
+ return -1;
+ }
+
+ if (ftruncate(fd, internal_config.memory) < 0) {
+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
+ filepath, strerror(errno));
+ return -1;
+ }
+
+ addr = mmap(NULL, internal_config.memory,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, 0);
if (addr == MAP_FAILED) {
- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
- strerror(errno));
+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
+ __func__, strerror(errno));
return -1;
}
mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
+
+ close(fd);
+
return 0;
}
--
2.1.4
Yuanhan Liu
2016-03-07 13:13:22 UTC
Permalink
CC'ed EAL hugepage maintainer, which is something you should do when
send a patch.
Post by Jianfeng Tan
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.
For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
...
Post by Jianfeng Tan
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }
The two limitation doesn't make sense to me.
Post by Jianfeng Tan
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 6008533..68ef49a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
/* get pointer to global configuration */
mcfg = rte_eal_get_configuration()->mem_config;
- /* hugetlbfs can be disabled */
- if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ /* when hugetlbfs is disabled or single-file option is specified */
+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
+ int fd;
+ uint64_t pagesize;
+ unsigned socket_id = rte_socket_id();
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ if (internal_config.no_hugetlbfs) {
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ "/dev/shm", 0);
+ pagesize = RTE_PGSIZE_4K;
+ } else {
+ struct hugepage_info *hpi;
+
+ hpi = &internal_config.hugepage_info[0];
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ hpi->hugedir, 0);
+ pagesize = hpi->hugepage_sz;
+ }
+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
+ __func__, filepath, strerror(errno));
+ return -1;
+ }
+
+ if (ftruncate(fd, internal_config.memory) < 0) {
+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
+ filepath, strerror(errno));
+ return -1;
+ }
+
+ addr = mmap(NULL, internal_config.memory,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, 0);
if (addr == MAP_FAILED) {
- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
- strerror(errno));
+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
+ __func__, strerror(errno));
return -1;
}
mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
I saw quite few issues:

- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
mounted at /dev/hugepages and /mnt, respectively.

Here we then got an 5G internal_config.memory, and your code will
try to mmap 5G on the first mount point (/dev/hugepages) due to the
hardcode logic in your code:

hpi = &internal_config.hugepage_info[0];
eal_get_hugefile_path(filepath, sizeof(filepath),
hpi->hugedir, 0);

But it has 4G in total, therefore, it will fails.

- As you stated, socket_id is hardcoded, which could be wrong.

- As stated in above, the option limitation doesn't seem right to me.

I mean, --single-file should be able to work with --socket-mem option
in semantic.


And I have been thinking how to deal with those issues properly, and a
__very immature__ solution come to my mind (which could be simply not
working), but anyway, here is FYI: we go through the same process to
handle normal huge page initilization to --single-file option as well.
But we take different actions or no actions at all at some stages when
that option is given, which is a bit similiar with the way of handling
RTE_EAL_SINGLE_FILE_SEGMENTS.

And we create one hugepage file for each node, each page size. For a
system like mine above (2 nodes), it may populate files like following:

- 1G x 2 on node0
- 1G x 2 on node1
- 2M x 256 on node0
- 2M x 256 on node1

That could normally fit your case. Though 4 nodes looks like the maximum
node number, --socket-mem option may relieve the limit a bit.

And if we "could" not care the socket_id being set correctly, we could
just simply allocate one file for each hugepage size. That would work
well for your container enabling.

BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
another option --single-file looks really confusing to me.

To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.

--yliu
Tan, Jianfeng
2016-03-08 01:55:10 UTC
Permalink
Hi Yuanhan,
Post by Yuanhan Liu
CC'ed EAL hugepage maintainer, which is something you should do when
send a patch.
Thanks for doing this.
Post by Yuanhan Liu
Post by Jianfeng Tan
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.
For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
...
Post by Jianfeng Tan
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }
The two limitation doesn't make sense to me.
For the force_sockets option, my original thought on --single-file
option is, we don't sort those pages (require root/cap_sys_admin) and
even don't look up numa information because it may contain both sockets'
memory.

For the hugepage_unlink option, those hugepage files get closed in the
end of memory initialization, if we even unlink those hugepage files, so
we cannot share those with other processes (say backend).
Post by Yuanhan Liu
Post by Jianfeng Tan
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 6008533..68ef49a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
/* get pointer to global configuration */
mcfg = rte_eal_get_configuration()->mem_config;
- /* hugetlbfs can be disabled */
- if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ /* when hugetlbfs is disabled or single-file option is specified */
+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
+ int fd;
+ uint64_t pagesize;
+ unsigned socket_id = rte_socket_id();
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ if (internal_config.no_hugetlbfs) {
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ "/dev/shm", 0);
+ pagesize = RTE_PGSIZE_4K;
+ } else {
+ struct hugepage_info *hpi;
+
+ hpi = &internal_config.hugepage_info[0];
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ hpi->hugedir, 0);
+ pagesize = hpi->hugepage_sz;
+ }
+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
+ __func__, filepath, strerror(errno));
+ return -1;
+ }
+
+ if (ftruncate(fd, internal_config.memory) < 0) {
+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
+ filepath, strerror(errno));
+ return -1;
+ }
+
+ addr = mmap(NULL, internal_config.memory,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, 0);
if (addr == MAP_FAILED) {
- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
- strerror(errno));
+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
+ __func__, strerror(errno));
return -1;
}
mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
mounted at /dev/hugepages and /mnt, respectively.
Here we then got an 5G internal_config.memory, and your code will
try to mmap 5G on the first mount point (/dev/hugepages) due to the
hpi = &internal_config.hugepage_info[0];
eal_get_hugefile_path(filepath, sizeof(filepath),
hpi->hugedir, 0);
But it has 4G in total, therefore, it will fails.
As mentioned above, this case is not for original design of --single-file.
Post by Yuanhan Liu
- As you stated, socket_id is hardcoded, which could be wrong.
We rely on OS to allocate hugepages, and cannot promise physical
hugepages in the big hugepage file are from the same socket.
Post by Yuanhan Liu
- As stated in above, the option limitation doesn't seem right to me.
I mean, --single-file should be able to work with --socket-mem option
in semantic.
If we'd like to work well with --socket-mem option, we need to use
syscalls like set_mempolicy(), mbind(). So it'll bring bigger change
related to current one. I don't know if it's acceptable?
Post by Yuanhan Liu
And I have been thinking how to deal with those issues properly, and a
__very immature__ solution come to my mind (which could be simply not
working), but anyway, here is FYI: we go through the same process to
handle normal huge page initilization to --single-file option as well.
But we take different actions or no actions at all at some stages when
that option is given, which is a bit similiar with the way of handling
RTE_EAL_SINGLE_FILE_SEGMENTS.
And we create one hugepage file for each node, each page size. For a
- 1G x 2 on node0
- 1G x 2 on node1
- 2M x 256 on node0
- 2M x 256 on node1
That could normally fit your case. Though 4 nodes looks like the maximum
node number, --socket-mem option may relieve the limit a bit.
And if we "could" not care the socket_id being set correctly, we could
just simply allocate one file for each hugepage size. That would work
well for your container enabling.
This way seems a good option at first sight. Let's compare this new way
with original design.

The original design just covers the simplest scenario:
a. just one hugetlbfs (new way can provide support for multiple number
of hugetlbfs)
b. does not require a root privilege (new way can achieve this by using
above-mentioned mind() or set_mempolicy() syscall)
c. no sorting (both way are OK)
d. performance, from the perspective of virtio for container, we take
more consideration about the performance of address translation in the
vhost. In the vhost, now we adopt a O(n) linear comparison to translate
address (this can be optimized to O(logn) using segment tree, or even
better using a cache, sorry, it's just another problem), so we should
maintain as few files as possible. (new way can achieve this by used
with --socket-mem, --huge-dir)
e. numa aware is not required (and it's complex). (new way can solve
this without promise)

In all, this new way seems great for me.

Another thing is if "go through the same process to handle normal huge
page initilization", my consideration is: RTE_EAL_SINGLE_FILE_SEGMENTS
goes such way to maximize code reuse. But the new way has few common
code with original ways. And mixing these options together leads to bad
readability. How do you think?
Post by Yuanhan Liu
BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
another option --single-file looks really confusing to me.
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
This is a great advice. So how do you think of --converged, or
--no-scattered-mem, or any better idea?

Thanks for valuable input.

Jianfeng
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2016-03-08 02:44:37 UTC
Permalink
Post by Tan, Jianfeng
Hi Yuanhan,
Post by Yuanhan Liu
CC'ed EAL hugepage maintainer, which is something you should do when
send a patch.
Thanks for doing this.
Post by Yuanhan Liu
Post by Jianfeng Tan
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.
For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
...
Post by Jianfeng Tan
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }
The two limitation doesn't make sense to me.
For the force_sockets option, my original thought on --single-file option
is, we don't sort those pages (require root/cap_sys_admin) and even don't
look up numa information because it may contain both sockets' memory.
For the hugepage_unlink option, those hugepage files get closed in the end
of memory initialization, if we even unlink those hugepage files, so we
cannot share those with other processes (say backend).
Yeah, I know how the two limitations come, from your implementation. I
was just wondering if they both are __truly__ the limitations. I mean,
can we get rid of them somehow?

For --socket-mem option, if we can't handle it well, or if we could
ignore the socket_id for allocated huge page, yes, the limitation is
a true one.

But for the second option, no, we should be able to co-work it with
well. One extra action is you should not invoke "close(fd)" for those
huge page files. And then you can get all the informations as I stated
in a reply to your 2nd patch.
Post by Tan, Jianfeng
Post by Yuanhan Liu
Post by Jianfeng Tan
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 6008533..68ef49a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
/* get pointer to global configuration */
mcfg = rte_eal_get_configuration()->mem_config;
- /* hugetlbfs can be disabled */
- if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ /* when hugetlbfs is disabled or single-file option is specified */
+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
+ int fd;
+ uint64_t pagesize;
+ unsigned socket_id = rte_socket_id();
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ if (internal_config.no_hugetlbfs) {
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ "/dev/shm", 0);
+ pagesize = RTE_PGSIZE_4K;
+ } else {
+ struct hugepage_info *hpi;
+
+ hpi = &internal_config.hugepage_info[0];
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ hpi->hugedir, 0);
+ pagesize = hpi->hugepage_sz;
+ }
+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
+ __func__, filepath, strerror(errno));
+ return -1;
+ }
+
+ if (ftruncate(fd, internal_config.memory) < 0) {
+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
+ filepath, strerror(errno));
+ return -1;
+ }
+
+ addr = mmap(NULL, internal_config.memory,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, 0);
if (addr == MAP_FAILED) {
- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
- strerror(errno));
+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
+ __func__, strerror(errno));
return -1;
}
mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
mounted at /dev/hugepages and /mnt, respectively.
Here we then got an 5G internal_config.memory, and your code will
try to mmap 5G on the first mount point (/dev/hugepages) due to the
hpi = &internal_config.hugepage_info[0];
eal_get_hugefile_path(filepath, sizeof(filepath),
hpi->hugedir, 0);
But it has 4G in total, therefore, it will fails.
As mentioned above, this case is not for original design of --single-file.
But it's a so common case, isn't it?
Post by Tan, Jianfeng
Post by Yuanhan Liu
- As you stated, socket_id is hardcoded, which could be wrong.
We rely on OS to allocate hugepages, and cannot promise physical hugepages
in the big hugepage file are from the same socket.
Post by Yuanhan Liu
- As stated in above, the option limitation doesn't seem right to me.
I mean, --single-file should be able to work with --socket-mem option
in semantic.
If we'd like to work well with --socket-mem option, we need to use syscalls
like set_mempolicy(), mbind(). So it'll bring bigger change related to
current one. I don't know if it's acceptable?
Yes, if that's the right way to go. But also as you stated, I doubt we
really need handle the numa affinitive here, due to it's complex.
Post by Tan, Jianfeng
Post by Yuanhan Liu
And I have been thinking how to deal with those issues properly, and a
__very immature__ solution come to my mind (which could be simply not
working), but anyway, here is FYI: we go through the same process to
handle normal huge page initilization to --single-file option as well.
But we take different actions or no actions at all at some stages when
that option is given, which is a bit similiar with the way of handling
RTE_EAL_SINGLE_FILE_SEGMENTS.
And we create one hugepage file for each node, each page size. For a
- 1G x 2 on node0
- 1G x 2 on node1
- 2M x 256 on node0
- 2M x 256 on node1
That could normally fit your case. Though 4 nodes looks like the maximum
node number, --socket-mem option may relieve the limit a bit.
And if we "could" not care the socket_id being set correctly, we could
just simply allocate one file for each hugepage size. That would work
well for your container enabling.
This way seems a good option at first sight. Let's compare this new way with
original design.
a. just one hugetlbfs (new way can provide support for multiple number of
hugetlbfs)
b. does not require a root privilege (new way can achieve this by using
above-mentioned mind() or set_mempolicy() syscall)
c. no sorting (both way are OK)
d. performance, from the perspective of virtio for container, we take more
consideration about the performance of address translation in the vhost. In
the vhost, now we adopt a O(n) linear comparison to translate address (this
can be optimized to O(logn) using segment tree, or even better using a
cache, sorry, it's just another problem), so we should maintain as few files
as possible. (new way can achieve this by used with --socket-mem,
--huge-dir)
e. numa aware is not required (and it's complex). (new way can solve this
without promise)
In all, this new way seems great for me.
Another thing is if "go through the same process to handle normal huge page
initilization", my consideration is: RTE_EAL_SINGLE_FILE_SEGMENTS goes such
way to maximize code reuse. But the new way has few common code with
original ways. And mixing these options together leads to bad readability.
How do you think?
Indeed. I've already found that the code is a bit hard to read, due to
many "#ifdef ... #else .. #endif" blocks, for RTE_EAL_SINGLE_FILE_SEGMENTS
as well as some special archs.

Therefore, I would suggest to do it as below: add another option based
on the SINGLE_FILE_SEGMENTS implementation.

I mean SINGLE_FILE_SEGMENTS already tries to generate as few files as
possible. If we add another option, say --no-sort (or --no-phys-continuity),
we could add just few lines of code to let it generate one file for
each huge page size (if we don't consider the numa affinity).
Post by Tan, Jianfeng
Post by Yuanhan Liu
BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
another option --single-file looks really confusing to me.
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
This is a great advice. So how do you think of --converged, or
--no-scattered-mem, or any better idea?
TBH, none of them looks great to me, either. But I have no better
options. Well, --no-phys-continuity looks like the best option to
me so far :)

--yliu
Post by Tan, Jianfeng
Thanks for valuable input.
Jianfeng
Post by Yuanhan Liu
--yliu
Tan, Jianfeng
2016-03-09 14:44:01 UTC
Permalink
Hi,
Post by Yuanhan Liu
Post by Tan, Jianfeng
Hi Yuanhan,
Post by Yuanhan Liu
CC'ed EAL hugepage maintainer, which is something you should do when
send a patch.
Thanks for doing this.
Post by Yuanhan Liu
Post by Jianfeng Tan
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.
For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
...
Post by Jianfeng Tan
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }
The two limitation doesn't make sense to me.
For the force_sockets option, my original thought on --single-file option
is, we don't sort those pages (require root/cap_sys_admin) and even don't
look up numa information because it may contain both sockets' memory.
For the hugepage_unlink option, those hugepage files get closed in the end
of memory initialization, if we even unlink those hugepage files, so we
cannot share those with other processes (say backend).
Yeah, I know how the two limitations come, from your implementation. I
was just wondering if they both are __truly__ the limitations. I mean,
can we get rid of them somehow?
For --socket-mem option, if we can't handle it well, or if we could
ignore the socket_id for allocated huge page, yes, the limitation is
a true one.
To make it work with --socket-mem option, we need to call
mbind()/set_mempolicy(), which leads to including "LDFLAGS += -lnuma" a
mandatory line in mk file. Don't know if it's acceptable to bring in
dependency on libnuma.so?
Post by Yuanhan Liu
But for the second option, no, we should be able to co-work it with
well. One extra action is you should not invoke "close(fd)" for those
huge page files. And then you can get all the informations as I stated
in a reply to your 2nd patch.
As discussed yesterday, I think there's a open files limitation for each
process, if we keep those FDs open, it will bring failure to those
existing programs. If others treat it as a problem?
...
Post by Yuanhan Liu
Post by Tan, Jianfeng
Post by Yuanhan Liu
BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
another option --single-file looks really confusing to me.
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
This is a great advice. So how do you think of --converged, or
--no-scattered-mem, or any better idea?
TBH, none of them looks great to me, either. But I have no better
options. Well, --no-phys-continuity looks like the best option to
me so far :)
I'd like to make it a little more concise, how about --no-phys-contig?
In addition, Yuanhan thinks there's still no literal meaning that just
create one file for each hugetlbfs (or socket). But from my side,
there's an indirect meaning, because if no need to promise
physically-contig, then no need to create hugepages one by one. Anyone
can give your option here? Thanks.

Thanks,
Jianfeng
Panu Matilainen
2016-03-08 08:49:30 UTC
Permalink
Post by Yuanhan Liu
CC'ed EAL hugepage maintainer, which is something you should do when
send a patch.
Post by Jianfeng Tan
Originally, there're two cons in using hugepage: a. needs root
privilege to touch /proc/self/pagemap, which is a premise to
alllocate physically contiguous memseg; b. possibly too many
hugepage file are created, especially used with 2M hugepage.
For virtual devices, they don't care about physical-contiguity
of allocated hugepages at all. Option --single-file is to
provide a way to allocate all hugepages into single mem-backed
file.
a. single-file option relys on kernel to allocate numa-affinitive
memory.
b. possible ABI break, originally, --no-huge uses anonymous memory
instead of file-backed way to create memory.
...
Post by Jianfeng Tan
@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
+ "be specified together with --"OPT_SOCKET_MEM"\n");
+ return -1;
+ }
+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
+ "be specified together with --"OPT_SINGLE_FILE"\n");
+ return -1;
+ }
The two limitation doesn't make sense to me.
Post by Jianfeng Tan
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 6008533..68ef49a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
/* get pointer to global configuration */
mcfg = rte_eal_get_configuration()->mem_config;
- /* hugetlbfs can be disabled */
- if (internal_config.no_hugetlbfs) {
- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ /* when hugetlbfs is disabled or single-file option is specified */
+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
+ int fd;
+ uint64_t pagesize;
+ unsigned socket_id = rte_socket_id();
+ char filepath[MAX_HUGEPAGE_PATH];
+
+ if (internal_config.no_hugetlbfs) {
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ "/dev/shm", 0);
+ pagesize = RTE_PGSIZE_4K;
+ } else {
+ struct hugepage_info *hpi;
+
+ hpi = &internal_config.hugepage_info[0];
+ eal_get_hugefile_path(filepath, sizeof(filepath),
+ hpi->hugedir, 0);
+ pagesize = hpi->hugepage_sz;
+ }
+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
+ if (fd < 0) {
+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
+ __func__, filepath, strerror(errno));
+ return -1;
+ }
+
+ if (ftruncate(fd, internal_config.memory) < 0) {
+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
+ filepath, strerror(errno));
+ return -1;
+ }
+
+ addr = mmap(NULL, internal_config.memory,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, 0);
if (addr == MAP_FAILED) {
- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
- strerror(errno));
+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
+ __func__, strerror(errno));
return -1;
}
mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
mcfg->memseg[0].addr = addr;
- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
+ mcfg->memseg[0].hugepage_sz = pagesize;
mcfg->memseg[0].len = internal_config.memory;
- mcfg->memseg[0].socket_id = 0;
+ mcfg->memseg[0].socket_id = socket_id;
- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
mounted at /dev/hugepages and /mnt, respectively.
Here we then got an 5G internal_config.memory, and your code will
try to mmap 5G on the first mount point (/dev/hugepages) due to the
hpi = &internal_config.hugepage_info[0];
eal_get_hugefile_path(filepath, sizeof(filepath),
hpi->hugedir, 0);
But it has 4G in total, therefore, it will fails.
- As you stated, socket_id is hardcoded, which could be wrong.
- As stated in above, the option limitation doesn't seem right to me.
I mean, --single-file should be able to work with --socket-mem option
in semantic.
And I have been thinking how to deal with those issues properly, and a
__very immature__ solution come to my mind (which could be simply not
working), but anyway, here is FYI: we go through the same process to
handle normal huge page initilization to --single-file option as well.
But we take different actions or no actions at all at some stages when
that option is given, which is a bit similiar with the way of handling
RTE_EAL_SINGLE_FILE_SEGMENTS.
And we create one hugepage file for each node, each page size. For a
- 1G x 2 on node0
- 1G x 2 on node1
- 2M x 256 on node0
- 2M x 256 on node1
That could normally fit your case. Though 4 nodes looks like the maximum
node number, --socket-mem option may relieve the limit a bit.
And if we "could" not care the socket_id being set correctly, we could
just simply allocate one file for each hugepage size. That would work
well for your container enabling.
BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
another option --single-file looks really confusing to me.
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the IVSHMEM
config uses, getting rid of it (by replacing with a runtime switch)
would be great. OTOH IVSHMEM itself seems to have fallen out of the
fashion since the memnic driver is unmaintained and broken since dpdk
2.0... CC'ing the IVSHMEM maintainer in case he has thoughts on this.

- Panu -
Yuanhan Liu
2016-03-08 09:04:45 UTC
Permalink
Post by Yuanhan Liu
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the IVSHMEM config
uses, getting rid of it (by replacing with a runtime switch) would be great.
Can't agree more.

BTW, FYI, Jianfeng and I had a private talk, and we came to agree that
it might be better to handle it outside the normal huge page init stage,
just like this patch does, but adding the support of multiple huge page
sizes. Let's not add more messy code there.

--yliu
OTOH IVSHMEM itself seems to have fallen out of the fashion since the memnic
driver is unmaintained and broken since dpdk 2.0... CC'ing the IVSHMEM
maintainer in case he has thoughts on this.
Thomas Monjalon
2016-03-08 10:30:33 UTC
Permalink
Post by Yuanhan Liu
Post by Yuanhan Liu
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the IVSHMEM config
uses, getting rid of it (by replacing with a runtime switch) would be great.
Can't agree more.
+1
Post by Yuanhan Liu
BTW, FYI, Jianfeng and I had a private talk, and we came to agree that
it might be better to handle it outside the normal huge page init stage,
just like this patch does, but adding the support of multiple huge page
sizes. Let's not add more messy code there.
--yliu
OTOH IVSHMEM itself seems to have fallen out of the fashion since the memnic
driver is unmaintained and broken since dpdk 2.0... CC'ing the IVSHMEM
maintainer in case he has thoughts on this.
The ivshmem config was not used for memnic which was using ivshmem only for
data path.
CONFIG_RTE_LIBRTE_IVSHMEM and CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS are more
about full memory sharing.
I have the feeling it could be dropped.
It there are some users, I'd like to see a justification and a rework to
remove these build options.
Burakov, Anatoly
2016-03-08 10:57:22 UTC
Permalink
Hi Thomas,
Post by Jianfeng Tan
Post by Yuanhan Liu
Post by Panu Matilainen
Post by Yuanhan Liu
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and
add another option, say --no-sort (I confess this name sucks, but
you get my point). With that, we could make sure to create as least
huge page files as possible, to fit your case.
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the
IVSHMEM
Post by Yuanhan Liu
Post by Panu Matilainen
config uses, getting rid of it (by replacing with a runtime switch) would be
great.
Post by Yuanhan Liu
Can't agree more.
+1
Post by Yuanhan Liu
BTW, FYI, Jianfeng and I had a private talk, and we came to agree that
it might be better to handle it outside the normal huge page init
stage, just like this patch does, but adding the support of multiple
huge page sizes. Let's not add more messy code there.
--yliu
Post by Panu Matilainen
OTOH IVSHMEM itself seems to have fallen out of the fashion since
the memnic driver is unmaintained and broken since dpdk 2.0...
CC'ing the IVSHMEM maintainer in case he has thoughts on this.
The ivshmem config was not used for memnic which was using ivshmem only
for data path.
CONFIG_RTE_LIBRTE_IVSHMEM and
CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS are more about full memory
sharing.
I have the feeling it could be dropped.
It there are some users, I'd like to see a justification and a rework to remove
these build options.
Just to add my opinion to it - if there are no users for both of these, I'd like for those to be removed as well. Less maintenance is always better than more maintenance, especially for things that no one uses :)

Thanks,
Anatoly
Traynor, Kevin
2016-03-14 13:53:33 UTC
Permalink
-----Original Message-----
Sent: Tuesday, March 8, 2016 10:31 AM
Subject: Re: [dpdk-dev] [PATCH v2 1/5] mem: add --single-file to create
single mem-backed file
Post by Yuanhan Liu
Post by Panu Matilainen
Post by Yuanhan Liu
To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the IVSHMEM
config
Post by Yuanhan Liu
Post by Panu Matilainen
uses, getting rid of it (by replacing with a runtime switch) would be
great.
Post by Yuanhan Liu
Can't agree more.
+1
Post by Yuanhan Liu
BTW, FYI, Jianfeng and I had a private talk, and we came to agree that
it might be better to handle it outside the normal huge page init stage,
just like this patch does, but adding the support of multiple huge page
sizes. Let's not add more messy code there.
--yliu
Post by Panu Matilainen
OTOH IVSHMEM itself seems to have fallen out of the fashion since the
memnic
Post by Yuanhan Liu
Post by Panu Matilainen
driver is unmaintained and broken since dpdk 2.0... CC'ing the IVSHMEM
maintainer in case he has thoughts on this.
The ivshmem config was not used for memnic which was using ivshmem only for
data path.
CONFIG_RTE_LIBRTE_IVSHMEM and CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS are more
about full memory sharing.
I have the feeling it could be dropped.
It there are some users, I'd like to see a justification and a rework to
remove these build options.
Just to clarify - is this suggesting the removal of the IVSHMEM library itself,
or just some of the config options?

The reason I ask is that although we don't currently use it in OVS with DPDK,
I've seen at least one person using it in conjunction with the ring interface.
There may be others, so I want to cross-post if there's a deprecation discussion.

Kevin.
Thomas Monjalon
2016-03-14 14:45:25 UTC
Permalink
From: Thomas Monjalon
Post by Jianfeng Tan
Post by Yuanhan Liu
Post by Panu Matilainen
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the IVSHMEM
config
Post by Yuanhan Liu
Post by Panu Matilainen
uses, getting rid of it (by replacing with a runtime switch) would be
great.
Post by Yuanhan Liu
Can't agree more.
+1
Post by Yuanhan Liu
Post by Panu Matilainen
OTOH IVSHMEM itself seems to have fallen out of the fashion since the
memnic
Post by Yuanhan Liu
Post by Panu Matilainen
driver is unmaintained and broken since dpdk 2.0... CC'ing the IVSHMEM
maintainer in case he has thoughts on this.
The ivshmem config was not used for memnic which was using ivshmem only for
data path.
CONFIG_RTE_LIBRTE_IVSHMEM and CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS are more
about full memory sharing.
I have the feeling it could be dropped.
It there are some users, I'd like to see a justification and a rework to
remove these build options.
Just to clarify - is this suggesting the removal of the IVSHMEM library itself,
or just some of the config options?
I have no strong opinion about the library.
About the config options, yes they should be removed. Note that they are not
documented, so we don't really know the motivation to have them.
The reason I ask is that although we don't currently use it in OVS with DPDK,
I've seen at least one person using it in conjunction with the ring interface.
There may be others, so I want to cross-post if there's a deprecation discussion.
Thank you for sharing.
Traynor, Kevin
2016-03-14 18:21:20 UTC
Permalink
-----Original Message-----
Sent: Monday, March 14, 2016 2:45 PM
Subject: Re: [dpdk-dev] [PATCH v2 1/5] mem: add --single-file to create
single mem-backed file
From: Thomas Monjalon
Post by Jianfeng Tan
Post by Yuanhan Liu
Post by Panu Matilainen
Note that SINGLE_FILE_SEGMENTS is a nasty hack that only the IVSHMEM
config
Post by Yuanhan Liu
Post by Panu Matilainen
uses, getting rid of it (by replacing with a runtime switch) would be
great.
Post by Yuanhan Liu
Can't agree more.
+1
Post by Yuanhan Liu
Post by Panu Matilainen
OTOH IVSHMEM itself seems to have fallen out of the fashion since the
memnic
Post by Yuanhan Liu
Post by Panu Matilainen
driver is unmaintained and broken since dpdk 2.0... CC'ing the
IVSHMEM
Post by Jianfeng Tan
Post by Yuanhan Liu
Post by Panu Matilainen
maintainer in case he has thoughts on this.
The ivshmem config was not used for memnic which was using ivshmem only
for
Post by Jianfeng Tan
data path.
CONFIG_RTE_LIBRTE_IVSHMEM and CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS are
more
Post by Jianfeng Tan
about full memory sharing.
I have the feeling it could be dropped.
It there are some users, I'd like to see a justification and a rework to
remove these build options.
Just to clarify - is this suggesting the removal of the IVSHMEM library
itself,
or just some of the config options?
I have no strong opinion about the library.
About the config options, yes they should be removed. Note that they are not
documented, so we don't really know the motivation to have them.
ok, thanks for clarifying. As there's no imminent plans to remove the library,
I won't cross post.
The reason I ask is that although we don't currently use it in OVS with
DPDK,
I've seen at least one person using it in conjunction with the ring
interface.
There may be others, so I want to cross-post if there's a deprecation
discussion.
Thank you for sharing.
Jianfeng Tan
2016-02-05 11:20:25 UTC
Permalink
A new API named rte_eal_get_backfile_info() and a new data
struct back_file is added to obstain information of memory-
backed file info.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
lib/librte_eal/common/include/rte_memory.h | 16 ++++++++++++
lib/librte_eal/linuxapp/eal/eal_memory.c | 40 +++++++++++++++++++++++++++++-
2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 587a25d..b09397e 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -109,6 +109,22 @@ struct rte_memseg {
} __rte_packed;

/**
+ * This struct is used to store information about memory-backed file that
+ * we mapped in memory initialization.
+ */
+struct back_file {
+ void *addr; /**< virtual addr */
+ size_t size; /**< the page size */
+ char filepath[PATH_MAX]; /**< path to backing file on filesystem */
+};
+
+/**
+ * Get the hugepage file information. Caller to free.
+ * Return number of hugepage files used.
+ */
+int rte_eal_get_backfile_info(struct back_file **);
+
+/**
* Lock page in physical memory and prevent from swapping.
*
* @param virt
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 68ef49a..a6b3616 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -743,6 +743,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi)
return 0;
}

+static struct hugepage_file *hugepage_files;
+static int num_hugepage_files;
+
/*
* Uses mmap to create a shared memory area for storage of data
* Used in this file to store the hugepage file map on disk
@@ -760,9 +763,30 @@ create_shared_memory(const char *filename, const size_t mem_size)
}
retval = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
+
+ hugepage_files = retval;
+ num_hugepage_files = mem_size / (sizeof(struct hugepage_file));
+
return retval;
}

+int
+rte_eal_get_backfile_info(struct back_file **p)
+{
+ struct back_file *backfiles;
+ int i, num_backfiles = num_hugepage_files;
+
+ backfiles = malloc(sizeof(struct back_file) * num_backfiles);
+ for (i = 0; i < num_backfiles; ++i) {
+ backfiles[i].addr = hugepage_files[i].final_va;
+ backfiles[i].size = hugepage_files[i].size;
+ strcpy(backfiles[i].filepath, hugepage_files[i].filepath);
+ }
+
+ *p = backfiles;
+ return num_backfiles;
+}
+
/*
* this copies *active* hugepages from one hugepage table to another.
* destination is typically the shared memory.
@@ -1148,8 +1172,22 @@ rte_eal_hugepage_init(void)
mcfg->memseg[0].len = internal_config.memory;
mcfg->memseg[0].socket_id = socket_id;

- close(fd);
+ hugepage = create_shared_memory(eal_hugepage_info_path(),
+ sizeof(struct hugepage_file));
+ hugepage->orig_va = addr;
+ hugepage->final_va = addr;
+ hugepage->physaddr = rte_mem_virt2phy(addr);
+ /* Suppose we have a very huge hugefile here */
+ hugepage->size = internal_config.memory;
+ hugepage->socket_id = socket_id;
+ hugepage->file_id = 0;
+ hugepage->memseg_id = 0;
+#ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
+ hugepage->repeated = internal_config.memory / pagesize;
+#endif
+ strncpy(hugepage->filepath, filepath, MAX_HUGEPAGE_PATH);

+ close(fd);
return 0;
}
--
2.1.4
Yuanhan Liu
2016-03-07 13:22:38 UTC
Permalink
Post by Jianfeng Tan
A new API named rte_eal_get_backfile_info() and a new data
struct back_file is added to obstain information of memory-
backed file info.
I would normally suggest to try hard to find some solution else, instead
of introducing yet another new API, espeically when you just came up with
one user only.
Post by Jianfeng Tan
---
lib/librte_eal/common/include/rte_memory.h | 16 ++++++++++++
lib/librte_eal/linuxapp/eal/eal_memory.c | 40 +++++++++++++++++++++++++++++-
2 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 587a25d..b09397e 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -109,6 +109,22 @@ struct rte_memseg {
} __rte_packed;
/**
+ * This struct is used to store information about memory-backed file that
+ * we mapped in memory initialization.
+ */
+struct back_file {
+ void *addr; /**< virtual addr */
+ size_t size; /**< the page size */
+ char filepath[PATH_MAX]; /**< path to backing file on filesystem */
+};
So, that's all the info you'd like to get. I'm thinking you may don't
need another new API to retrieve them at all:

Say, you can get the filepath and fd from /proc/self/fd (by filtering it
with "rtemap_"):

$ ls /proc/3487/fd -l
total 0
lrwx------ 1 root root 64 Mar 7 20:37 0 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 1 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 2 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 3 -> /run/.rte_config
lr-x------ 1 root root 64 Mar 7 20:37 4 -> /dev/hugepages
lr-x------ 1 root root 64 Mar 7 20:37 5 -> /mnt
==> lrwx------ 1 root root 64 Mar 7 20:37 6 -> /dev/hugepages/rtemap_0


Which could also save you an extra "open" at caller side for that
file as well.

And you can get the virtual addr and size from /proc/self/maps:

$ grep rtemap_ /proc/3487/maps
7fff40000000-7fffc0000000 rw-s 00000000 00:22 21082 /dev/hugepages/rtemap_0


Will that work for you?

--yliu
Tan, Jianfeng
2016-03-08 02:31:10 UTC
Permalink
Post by Yuanhan Liu
Post by Jianfeng Tan
A new API named rte_eal_get_backfile_info() and a new data
struct back_file is added to obstain information of memory-
backed file info.
I would normally suggest to try hard to find some solution else, instead
of introducing yet another new API, espeically when you just came up with
one user only.
Actually, Tetsuya's qtest patchset will make it two.
Post by Yuanhan Liu
Post by Jianfeng Tan
---
lib/librte_eal/common/include/rte_memory.h | 16 ++++++++++++
lib/librte_eal/linuxapp/eal/eal_memory.c | 40 +++++++++++++++++++++++++++++-
2 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 587a25d..b09397e 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -109,6 +109,22 @@ struct rte_memseg {
} __rte_packed;
/**
+ * This struct is used to store information about memory-backed file that
+ * we mapped in memory initialization.
+ */
+struct back_file {
+ void *addr; /**< virtual addr */
+ size_t size; /**< the page size */
+ char filepath[PATH_MAX]; /**< path to backing file on filesystem */
+};
So, that's all the info you'd like to get. I'm thinking you may don't
Say, you can get the filepath and fd from /proc/self/fd (by filtering it
$ ls /proc/3487/fd -l
total 0
lrwx------ 1 root root 64 Mar 7 20:37 0 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 1 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 2 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 3 -> /run/.rte_config
lr-x------ 1 root root 64 Mar 7 20:37 4 -> /dev/hugepages
lr-x------ 1 root root 64 Mar 7 20:37 5 -> /mnt
==> lrwx------ 1 root root 64 Mar 7 20:37 6 -> /dev/hugepages/rtemap_0
I guess this rtemap_xxx has been closed after memory initialization and
cannot be obtained from /proc/xxx/fd. I believe /proc/xxx/maps is what
you want to say.
Post by Yuanhan Liu
Which could also save you an extra "open" at caller side for that
file as well.
Same reason, we cannot save extra "open".
Post by Yuanhan Liu
$ grep rtemap_ /proc/3487/maps
7fff40000000-7fffc0000000 rw-s 00000000 00:22 21082 /dev/hugepages/rtemap_0
Will that work for you?
Yes, from function's side, it works for me. But it needs some string
processing. Another way is to just exposed an global variable pointing
to the address of /run/.rte_config, so that callers extract needed
information by themselves using "struct hugepage_file". How do you think?

Thanks,
Jianfeng
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2016-03-08 02:53:46 UTC
Permalink
Post by Tan, Jianfeng
Post by Yuanhan Liu
Post by Jianfeng Tan
A new API named rte_eal_get_backfile_info() and a new data
struct back_file is added to obstain information of memory-
backed file info.
I would normally suggest to try hard to find some solution else, instead
of introducing yet another new API, espeically when you just came up with
one user only.
Actually, Tetsuya's qtest patchset will make it two.
Well, it's actually a same story. So, still one user to me.
Post by Tan, Jianfeng
Post by Yuanhan Liu
Post by Jianfeng Tan
---
lib/librte_eal/common/include/rte_memory.h | 16 ++++++++++++
lib/librte_eal/linuxapp/eal/eal_memory.c | 40 +++++++++++++++++++++++++++++-
2 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h
index 587a25d..b09397e 100644
--- a/lib/librte_eal/common/include/rte_memory.h
+++ b/lib/librte_eal/common/include/rte_memory.h
@@ -109,6 +109,22 @@ struct rte_memseg {
} __rte_packed;
/**
+ * This struct is used to store information about memory-backed file that
+ * we mapped in memory initialization.
+ */
+struct back_file {
+ void *addr; /**< virtual addr */
+ size_t size; /**< the page size */
+ char filepath[PATH_MAX]; /**< path to backing file on filesystem */
+};
So, that's all the info you'd like to get. I'm thinking you may don't
Say, you can get the filepath and fd from /proc/self/fd (by filtering it
$ ls /proc/3487/fd -l
total 0
lrwx------ 1 root root 64 Mar 7 20:37 0 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 1 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 2 -> /dev/pts/2
lrwx------ 1 root root 64 Mar 7 20:37 3 -> /run/.rte_config
lr-x------ 1 root root 64 Mar 7 20:37 4 -> /dev/hugepages
lr-x------ 1 root root 64 Mar 7 20:37 5 -> /mnt
==> lrwx------ 1 root root 64 Mar 7 20:37 6 -> /dev/hugepages/rtemap_0
I guess this rtemap_xxx has been closed after memory initialization and
cannot be obtained from /proc/xxx/fd. I believe /proc/xxx/maps is what you
want to say.
Yes, I forgot to mention that you need keep that file open.
So, you just need one line or two to not close that file
in this case.
Post by Tan, Jianfeng
Post by Yuanhan Liu
Which could also save you an extra "open" at caller side for that
file as well.
Same reason, we cannot save extra "open".
We could, if we keep the file open.
Post by Tan, Jianfeng
Post by Yuanhan Liu
$ grep rtemap_ /proc/3487/maps
7fff40000000-7fffc0000000 rw-s 00000000 00:22 21082 /dev/hugepages/rtemap_0
Will that work for you?
Yes, from function's side, it works for me. But it needs some string
processing.
What's wrong of the string processing? I have seen many string
processings in DPDK code, even in rte_memory.c.
Post by Tan, Jianfeng
Another way is to just exposed an global variable pointing to
the address of /run/.rte_config, so that callers extract needed information
by themselves using "struct hugepage_file". How do you think?
That doens't seem elegant to me.

--yliu
Jianfeng Tan
2016-02-05 11:20:26 UTC
Permalink
To implement virtio vdev, we need way to interract with vhost backend.
And more importantly, needs way to emulate a device into DPDK. So this
patch acts as embedded device emulation.

Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.h | 194 +++++++++
drivers/net/virtio/vhost_embedded.c | 809 ++++++++++++++++++++++++++++++++++++
drivers/net/virtio/virtio_ethdev.h | 6 +-
drivers/net/virtio/virtio_pci.h | 15 +-
6 files changed, 1031 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/virtio/vhost.h
create mode 100644 drivers/net/virtio/vhost_embedded.c

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..f76e162 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -534,3 +534,8 @@ CONFIG_RTE_APP_TEST=y
CONFIG_RTE_TEST_PMD=y
CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=n
CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=n
+
+#
+# Enable virtio support for container
+#
+CONFIG_RTE_VIRTIO_VDEV=y
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 43835ba..ef920f9 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -52,6 +52,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

+ifeq ($(CONFIG_RTE_VIRTIO_VDEV),y)
+ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += vhost_embedded.c
+endif
+
# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_mempool lib/librte_mbuf
diff --git a/drivers/net/virtio/vhost.h b/drivers/net/virtio/vhost.h
new file mode 100644
index 0000000..73d4f5c
--- /dev/null
+++ b/drivers/net/virtio/vhost.h
@@ -0,0 +1,194 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _VHOST_NET_USER_H
+#define _VHOST_NET_USER_H
+
+#include <stdint.h>
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VHOST_MEMORY_MAX_NREGIONS 8
+
+struct vhost_vring_state {
+ unsigned int index;
+ unsigned int num;
+};
+
+struct vhost_vring_file {
+ unsigned int index;
+ int fd;
+};
+
+struct vhost_vring_addr {
+ unsigned int index;
+ /* Option flags. */
+ unsigned int flags;
+ /* Flag values: */
+ /* Whether log address is valid. If set enables logging. */
+#define VHOST_VRING_F_LOG 0
+
+ /* Start of array of descriptors (virtually contiguous) */
+ uint64_t desc_user_addr;
+ /* Used structure address. Must be 32 bit aligned */
+ uint64_t used_user_addr;
+ /* Available structure address. Must be 16 bit aligned */
+ uint64_t avail_user_addr;
+ /* Logging support. */
+ /* Log writes to used structure, at offset calculated from specified
+ * address. Address must be 32 bit aligned.
+ */
+ uint64_t log_guest_addr;
+};
+
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+
+enum vhost_user_request {
+ VHOST_USER_NONE = 0,
+ VHOST_USER_GET_FEATURES = 1,
+ VHOST_USER_SET_FEATURES = 2,
+ VHOST_USER_SET_OWNER = 3,
+ VHOST_USER_RESET_OWNER = 4,
+ VHOST_USER_SET_MEM_TABLE = 5,
+ VHOST_USER_SET_LOG_BASE = 6,
+ VHOST_USER_SET_LOG_FD = 7,
+ VHOST_USER_SET_VRING_NUM = 8,
+ VHOST_USER_SET_VRING_ADDR = 9,
+ VHOST_USER_SET_VRING_BASE = 10,
+ VHOST_USER_GET_VRING_BASE = 11,
+ VHOST_USER_SET_VRING_KICK = 12,
+ VHOST_USER_SET_VRING_CALL = 13,
+ VHOST_USER_SET_VRING_ERR = 14,
+ VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+ VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+ VHOST_USER_GET_QUEUE_NUM = 17,
+ VHOST_USER_SET_VRING_ENABLE = 18,
+ VHOST_USER_MAX
+};
+
+struct vhost_memory_region {
+ uint64_t guest_phys_addr;
+ uint64_t memory_size; /* bytes */
+ uint64_t userspace_addr;
+ uint64_t mmap_offset;
+};
+
+struct vhost_memory_kernel {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[0];
+};
+
+struct vhost_memory {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[VHOST_MEMORY_MAX_NREGIONS];
+};
+
+struct vhost_user_msg {
+ enum vhost_user_request request;
+
+#define VHOST_USER_VERSION_MASK 0x3
+#define VHOST_USER_REPLY_MASK (0x1 << 2)
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+#define VHOST_USER_VRING_IDX_MASK 0xff
+#define VHOST_USER_VRING_NOFD_MASK (0x1 << 8)
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ struct vhost_memory memory;
+ } payload;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+} __attribute((packed));
+
+#define VHOST_USER_HDR_SIZE offsetof(struct vhost_user_msg, payload.u64)
+#define VHOST_USER_PAYLOAD_SIZE (sizeof(struct vhost_user_msg) - VHOST_USER_HDR_SIZE)
+
+/* The version of the protocol we support */
+#define VHOST_USER_VERSION 0x1
+
+/* ioctls */
+
+#define VHOST_VIRTIO 0xAF
+
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
+#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
+#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+/*****************************************************************************/
+
+/* Ioctl defines */
+#define TUNSETIFF _IOW('T', 202, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
+#define TUNSETOFFLOAD _IOW('T', 208, unsigned int)
+#define TUNGETIFF _IOR('T', 210, unsigned int)
+#define TUNSETSNDBUF _IOW('T', 212, int)
+#define TUNGETVNETHDRSZ _IOR('T', 215, int)
+#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE _IOW('T', 217, int)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNSETVNETBE _IOW('T', 222, int)
+
+/* TUNSETIFF ifr flags */
+#define IFF_TAP 0x0002
+#define IFF_NO_PI 0x1000
+#define IFF_ONE_QUEUE 0x2000
+#define IFF_VNET_HDR 0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
+
+/* Features for GSO (TUNSETOFFLOAD). */
+#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
+#define TUN_F_TSO4 0x02 /* I can handle TSO for IPv4 packets */
+#define TUN_F_TSO6 0x04 /* I can handle TSO for IPv6 packets */
+#define TUN_F_TSO_ECN 0x08 /* I can handle TSO with ECN bits. */
+#define TUN_F_UFO 0x10 /* I can handle UFO packets */
+
+#define PATH_NET_TUN "/dev/net/tun"
+
+#endif
diff --git a/drivers/net/virtio/vhost_embedded.c b/drivers/net/virtio/vhost_embedded.c
new file mode 100644
index 0000000..0073b86
--- /dev/null
+++ b/drivers/net/virtio/vhost_embedded.c
@@ -0,0 +1,809 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "vhost.h"
+
+static int
+vhost_user_write(int fd, void *buf, int len, int *fds, int fd_num)
+{
+ int r;
+ struct msghdr msgh;
+ struct iovec iov;
+ size_t fd_size = fd_num * sizeof(int);
+ char control[CMSG_SPACE(fd_size)];
+ struct cmsghdr *cmsg;
+
+ bzero(&msgh, sizeof(msgh));
+ bzero(control, sizeof(control));
+
+ iov.iov_base = (uint8_t *)buf;
+ iov.iov_len = len;
+
+ msgh.msg_iov = &iov;
+ msgh.msg_iovlen = 1;
+ msgh.msg_control = control;
+ msgh.msg_controllen = sizeof(control);
+
+ cmsg = CMSG_FIRSTHDR(&msgh);
+ cmsg->cmsg_len = CMSG_LEN(fd_size);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ memcpy(CMSG_DATA(cmsg), fds, fd_size);
+
+ do {
+ r = sendmsg(fd, &msgh, 0);
+ } while (r < 0 && errno == EINTR);
+
+ return r;
+}
+
+static int
+vhost_user_read(int fd, struct vhost_user_msg *msg)
+{
+ uint32_t valid_flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+ int ret, sz_hdr = VHOST_USER_HDR_SIZE, sz_payload;
+
+ ret = recv(fd, (void *)msg, sz_hdr, 0);
+ if (ret < sz_hdr) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg hdr: %d instead of %d.",
+ ret, sz_hdr);
+ goto fail;
+ }
+
+ /* validate msg flags */
+ if (msg->flags != (valid_flags)) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg: flags %x instead of %x.",
+ msg->flags, valid_flags);
+ goto fail;
+ }
+
+ sz_payload = msg->size;
+ if (sz_payload) {
+ ret = recv(fd, (void *)((char *)msg + sz_hdr), sz_payload, 0);
+ if (ret < sz_payload) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg payload: %d instead of %d.",
+ ret, msg->size);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+fail:
+ return -1;
+}
+
+static struct vhost_user_msg m __rte_unused;
+
+static void
+prepare_vhost_memory_user(struct vhost_user_msg *msg, int fds[])
+{
+ int i, num;
+ struct back_file *huges;
+ struct vhost_memory_region *mr;
+
+ num = rte_eal_get_backfile_info(&huges);
+
+ if (num > VHOST_MEMORY_MAX_NREGIONS)
+ rte_panic("%d files exceed maximum of %d for vhost-user\n",
+ num, VHOST_MEMORY_MAX_NREGIONS);
+
+ for (i = 0; i < num; ++i) {
+ mr = &msg->payload.memory.regions[i];
+ mr->guest_phys_addr = (uint64_t)huges[i].addr; /* use vaddr! */
+ mr->userspace_addr = (uint64_t)huges[i].addr;
+ mr->memory_size = huges[i].size;
+ mr->mmap_offset = 0;
+ fds[i] = open(huges[i].filepath, O_RDWR);
+ }
+
+ msg->payload.memory.nregions = num;
+ msg->payload.memory.padding = 0;
+ free(huges);
+}
+
+static int
+vhost_user_sock(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ struct vhost_user_msg msg;
+ struct vhost_vring_file *file = 0;
+ int need_reply = 0;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+ int fd_num = 0;
+ int i, len;
+
+ msg.request = req;
+ msg.flags = VHOST_USER_VERSION;
+ msg.size = 0;
+
+ switch (req) {
+ case VHOST_USER_GET_FEATURES:
+ need_reply = 1;
+ break;
+
+ case VHOST_USER_SET_FEATURES:
+ case VHOST_USER_SET_LOG_BASE:
+ msg.payload.u64 = *((__u64 *)arg);
+ msg.size = sizeof(m.payload.u64);
+ break;
+
+ case VHOST_USER_SET_OWNER:
+ case VHOST_USER_RESET_OWNER:
+ break;
+
+ case VHOST_USER_SET_MEM_TABLE:
+ prepare_vhost_memory_user(&msg, fds);
+ fd_num = msg.payload.memory.nregions;
+ msg.size = sizeof(m.payload.memory.nregions);
+ msg.size += sizeof(m.payload.memory.padding);
+ msg.size += fd_num * sizeof(struct vhost_memory_region);
+ break;
+
+ case VHOST_USER_SET_LOG_FD:
+ fds[fd_num++] = *((int *)arg);
+ break;
+
+ case VHOST_USER_SET_VRING_NUM:
+ case VHOST_USER_SET_VRING_BASE:
+ memcpy(&msg.payload.state, arg, sizeof(msg.payload.state));
+ msg.size = sizeof(m.payload.state);
+ break;
+
+ case VHOST_USER_GET_VRING_BASE:
+ memcpy(&msg.payload.state, arg, sizeof(msg.payload.state));
+ msg.size = sizeof(m.payload.state);
+ need_reply = 1;
+ break;
+
+ case VHOST_USER_SET_VRING_ADDR:
+ memcpy(&msg.payload.addr, arg, sizeof(msg.payload.addr));
+ msg.size = sizeof(m.payload.addr);
+ break;
+
+ case VHOST_USER_SET_VRING_KICK:
+ case VHOST_USER_SET_VRING_CALL:
+ case VHOST_USER_SET_VRING_ERR:
+ file = arg;
+ msg.payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK;
+ msg.size = sizeof(m.payload.u64);
+ if (file->fd > 0)
+ fds[fd_num++] = file->fd;
+ else
+ msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK;
+ break;
+
+ default:
+ PMD_DRV_LOG(ERR, "vhost-user trying to send unhandled msg type");
+ return -1;
+ }
+
+ len = VHOST_USER_HDR_SIZE + msg.size;
+ if (vhost_user_write(hw->vhostfd, &msg, len, fds, fd_num) < 0)
+ return 0;
+
+ if (req == VHOST_USER_SET_MEM_TABLE)
+ for (i = 0; i < fd_num; ++i)
+ close(fds[i]);
+
+ if (need_reply) {
+ if (vhost_user_read(hw->vhostfd, &msg) < 0)
+ return -1;
+
+ if (req != msg.request) {
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+
+ switch (req) {
+ case VHOST_USER_GET_FEATURES:
+ if (msg.size != sizeof(m.payload.u64)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ *((__u64 *)arg) = msg.payload.u64;
+ break;
+ case VHOST_USER_GET_VRING_BASE:
+ if (msg.size != sizeof(m.payload.state)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ memcpy(arg, &msg.payload.state,
+ sizeof(struct vhost_vring_state));
+ break;
+ default:
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+static int
+vhost_kernel_ioctl(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ return ioctl(hw->vhostfd, req, arg);
+}
+
+enum {
+ VHOST_MSG_SET_OWNER,
+ VHOST_MSG_SET_FEATURES,
+ VHOST_MSG_GET_FEATURES,
+ VHOST_MSG_SET_VRING_CALL,
+ VHOST_MSG_SET_VRING_NUM,
+ VHOST_MSG_SET_VRING_BASE,
+ VHOST_MSG_GET_VRING_BASE,
+ VHOST_MSG_SET_VRING_ADDR,
+ VHOST_MSG_SET_VRING_KICK,
+ VHOST_MSG_SET_MEM_TABLE,
+ VHOST_MSG_MAX,
+};
+
+static const char * const vhost_msg_strings[] = {
+ [VHOST_MSG_SET_OWNER] = "VHOST_MSG_SET_OWNER",
+ [VHOST_MSG_SET_FEATURES] = "VHOST_MSG_SET_FEATURES",
+ [VHOST_MSG_GET_FEATURES] = "VHOST_MSG_GET_FEATURES",
+ [VHOST_MSG_SET_VRING_CALL] = "VHOST_MSG_SET_VRING_CALL",
+ [VHOST_MSG_SET_VRING_NUM] = "VHOST_MSG_SET_VRING_NUM",
+ [VHOST_MSG_SET_VRING_BASE] = "VHOST_MSG_SET_VRING_BASE",
+ [VHOST_MSG_GET_VRING_BASE] = "VHOST_MSG_GET_VRING_BASE",
+ [VHOST_MSG_SET_VRING_ADDR] = "VHOST_MSG_SET_VRING_ADDR",
+ [VHOST_MSG_SET_VRING_KICK] = "VHOST_MSG_SET_VRING_KICK",
+ [VHOST_MSG_SET_MEM_TABLE] = "VHOST_MSG_SET_MEM_TABLE",
+ NULL,
+};
+
+static unsigned long int vhost_req_map[][2] = {
+ [VHOST_MSG_SET_OWNER] = {
+ VHOST_SET_OWNER, VHOST_USER_SET_OWNER
+ },
+ [VHOST_MSG_SET_FEATURES] = {
+ VHOST_SET_FEATURES, VHOST_USER_SET_FEATURES
+ },
+ [VHOST_MSG_GET_FEATURES] = {
+ VHOST_GET_FEATURES, VHOST_USER_GET_FEATURES
+ },
+ [VHOST_MSG_SET_VRING_CALL] = {
+ VHOST_SET_VRING_CALL, VHOST_USER_SET_VRING_CALL
+ },
+ [VHOST_MSG_SET_VRING_NUM] = {
+ VHOST_SET_VRING_NUM, VHOST_USER_SET_VRING_NUM
+ },
+ [VHOST_MSG_SET_VRING_BASE] = {
+ VHOST_SET_VRING_BASE, VHOST_USER_SET_VRING_BASE
+ },
+ [VHOST_MSG_GET_VRING_BASE] = {
+ VHOST_GET_VRING_BASE, VHOST_USER_GET_VRING_BASE
+ },
+ [VHOST_MSG_SET_VRING_ADDR] = {
+ VHOST_SET_VRING_ADDR, VHOST_USER_SET_VRING_ADDR
+ },
+ [VHOST_MSG_SET_VRING_KICK] = {
+ VHOST_SET_VRING_KICK, VHOST_USER_SET_VRING_KICK
+ },
+ [VHOST_MSG_SET_MEM_TABLE] = {
+ VHOST_SET_MEM_TABLE, VHOST_USER_SET_MEM_TABLE
+ },
+};
+
+static int
+vhost_call(struct virtio_hw *hw, unsigned long int req_orig, void *arg)
+{
+ unsigned long int req_new;
+ int ret;
+
+ if (req_orig >= VHOST_MSG_MAX)
+ rte_panic("invalid req: %lu\n", req_orig);
+
+ PMD_DRV_LOG(INFO, "%s\n", vhost_msg_strings[req_orig]);
+ req_new = vhost_req_map[req_orig][hw->type];
+ if (hw->type == VHOST_USER)
+ ret = vhost_user_sock(hw, req_new, arg);
+ else
+ ret = vhost_kernel_ioctl(hw, req_new, arg);
+
+ if (ret < 0)
+ rte_panic("vhost_call %s failed: %s\n",
+ vhost_msg_strings[req_orig], strerror(errno));
+
+ return ret;
+}
+
+static void
+kick_one_vq(struct virtio_hw *hw, struct virtqueue *vq, unsigned queue_sel)
+{
+ int callfd, kickfd;
+ struct vhost_vring_file file;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr = {
+ .index = queue_sel,
+ .desc_user_addr = (uint64_t)(uintptr_t)vq->vq_ring.desc,
+ .avail_user_addr = (uint64_t)(uintptr_t)vq->vq_ring.avail,
+ .used_user_addr = (uint64_t)(uintptr_t)vq->vq_ring.used,
+ .log_guest_addr = 0,
+ .flags = 0, /* disable log */
+ };
+
+ /* or use invalid flag to disable it, but vhost-dpdk uses this to judge
+ * if dev is alive. so finally we need two real event_fds.
+ */
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_CALL come
+ * firstly because vhost depends on this msg to allocate virtqueue
+ * pair.
+ */
+ callfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (callfd < 0)
+ rte_panic("callfd error, %s\n", strerror(errno));
+
+ file.index = queue_sel;
+ file.fd = callfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_CALL, &file);
+ hw->callfds[queue_sel] = callfd;
+
+ state.index = queue_sel;
+ state.num = vq->vq_ring.num;
+ vhost_call(hw, VHOST_MSG_SET_VRING_NUM, &state);
+
+ state.num = 0; /* no reservation */
+ vhost_call(hw, VHOST_MSG_SET_VRING_BASE, &state);
+
+ vhost_call(hw, VHOST_MSG_SET_VRING_ADDR, &addr);
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_KICK comes
+ * lastly because vhost depends on this msg to judge if
+ * virtio_is_ready().
+ */
+ kickfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (kickfd < 0)
+ rte_panic("kickfd error, %s\n", strerror(errno));
+
+ file.fd = kickfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_KICK, &file);
+ hw->kickfds[queue_sel] = kickfd;
+}
+
+/**
+ * Merge those virtually adjacent memsegs into one region.
+ */
+static void
+prepare_vhost_memory_kernel(struct vhost_memory_kernel **p_vm)
+{
+ unsigned i, j, k = 0;
+ struct rte_memseg *seg;
+ struct vhost_memory_region *mr;
+ struct vhost_memory_kernel *vm;
+
+ vm = malloc(sizeof(struct vhost_memory_kernel) +
+ RTE_MAX_MEMSEG * sizeof(struct vhost_memory_region));
+
+ for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
+ seg = &rte_eal_get_configuration()->mem_config->memseg[i];
+ if (!seg->addr)
+ break;
+
+ int new_region = 1;
+
+ for (j = 0; j < k; ++j) {
+ mr = &vm->regions[j];
+
+ if (mr->userspace_addr + mr->memory_size ==
+ (uint64_t)seg->addr) {
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+
+ if ((uint64_t)seg->addr + seg->len ==
+ mr->userspace_addr) {
+ mr->guest_phys_addr = (uint64_t)seg->addr;
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+ }
+
+ if (new_region == 0)
+ continue;
+
+ mr = &vm->regions[k++];
+ mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size = seg->len;
+ mr->mmap_offset = 0;
+ }
+
+ vm->nregions = k;
+ vm->padding = 0;
+ *p_vm = vm;
+}
+
+static void kick_all_vq(struct virtio_hw *hw)
+{
+ uint64_t features;
+ unsigned i, queue_sel, nvqs;
+ struct rte_eth_dev_data *data = hw->data;
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_memory_kernel *vm = NULL;
+
+ prepare_vhost_memory_kernel(&vm);
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, vm);
+ free(vm);
+ } else {
+ /* construct vhost_memory inside prepare_vhost_memory_user() */
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, NULL);
+ }
+
+ for (i = 0; i < data->nb_rx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_RQ_QUEUE_IDX;
+ kick_one_vq(hw, data->rx_queues[i], queue_sel);
+ }
+ for (i = 0; i < data->nb_tx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_TQ_QUEUE_IDX;
+ kick_one_vq(hw, data->tx_queues[i], queue_sel);
+ }
+
+ /* after setup all virtqueues, we need to set_features again
+ * so that these features can be set into each virtqueue in
+ * vhost side.
+ */
+ features = hw->guest_features;
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+ if (hw->type == VHOST_KERNEL)
+ if (ioctl(hw->backfd, TUNSETVNETHDRSZ,
+ &hw->vtnet_hdr_size) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n",
+ strerror(errno));
+ PMD_DRV_LOG(INFO, "set features:%" PRIx64 "\n", features);
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_vring_file file;
+
+ file.fd = hw->backfd;
+ nvqs = data->nb_rx_queues + data->nb_tx_queues;
+ for (file.index = 0; file.index < nvqs; ++file.index) {
+ if (vhost_kernel_ioctl(hw, VHOST_NET_SET_BACKEND,
+ &file) < 0)
+ rte_panic("VHOST_NET_SET_BACKEND failed, %s\n",
+ strerror(errno));
+ }
+ }
+}
+
+static void
+vdev_read_dev_config(struct virtio_hw *hw, uint64_t offset,
+ void *dst, int length)
+{
+ if (offset == offsetof(struct virtio_net_config, mac) &&
+ length == ETHER_ADDR_LEN) {
+ int i;
+
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ ((uint8_t *)dst)[i] = hw->mac_addr[i];
+ return;
+ }
+
+ if (offset == offsetof(struct virtio_net_config, status))
+ *(uint16_t *)dst = hw->status;
+
+ if (offset == offsetof(struct virtio_net_config, max_virtqueue_pairs))
+ *(uint16_t *)dst = hw->max_tx_queues;
+}
+
+static void
+vdev_write_dev_config(struct virtio_hw *hw, uint64_t offset,
+ const void *src, int length)
+{
+ int i;
+
+ if ((offset == offsetof(struct virtio_net_config, mac)) &&
+ (length == ETHER_ADDR_LEN))
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = ((const uint8_t *)src)[i];
+ else
+ rte_panic("offset=%" PRIu64 ", length=%d\n", offset, length);
+}
+
+static void
+vdev_set_status(struct virtio_hw *hw, uint8_t status)
+{
+ if (status & VIRTIO_CONFIG_S_DRIVER_OK)
+ kick_all_vq(hw);
+ hw->status = status;
+}
+
+static void
+vdev_reset(struct virtio_hw *hw __rte_unused)
+{
+ /* do nothing according to qemu vhost user spec */
+}
+
+static uint8_t
+vdev_get_status(struct virtio_hw *hw)
+{
+ return hw->status;
+}
+
+static uint64_t
+vdev_get_features(struct virtio_hw *hw)
+{
+ uint64_t host_features;
+
+ vhost_call(hw, VHOST_MSG_GET_FEATURES, &host_features);
+ if (hw->mac_specified)
+ host_features |= (1ull << VIRTIO_NET_F_MAC);
+ /* disable it until we support CQ */
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_VQ);
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_RX);
+ return host_features;
+}
+
+static void
+vdev_set_features(struct virtio_hw *hw, uint64_t features)
+{
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+}
+
+static uint8_t
+vdev_get_isr(struct virtio_hw *hw __rte_unused)
+{
+ rte_panic("");
+}
+
+static uint16_t
+vdev_set_config_irq(struct virtio_hw *hw __rte_unused,
+ uint16_t vec __rte_unused)
+{
+ rte_panic("");
+}
+
+static uint16_t
+vdev_get_queue_num(struct virtio_hw *hw,
+ uint16_t queue_id __rte_unused)
+{
+ return hw->queue_num;
+}
+
+static void
+vdev_setup_queue(struct virtio_hw *hw __rte_unused,
+ struct virtqueue *vq __rte_unused)
+{
+ /* do nothing */
+}
+
+static void
+vdev_del_queue(struct virtio_hw *hw __rte_unused,
+ struct virtqueue *vq)
+{
+ struct vhost_vring_state state = {
+ .index = vq->vq_queue_index,
+ };
+
+ vhost_call(hw, VHOST_MSG_GET_VRING_BASE, &state);
+ PMD_DRV_LOG(DEBUG, "state.num = %d\n", state.num);
+}
+
+static void
+vdev_notify_queue(struct virtio_hw *hw, struct virtqueue *vq)
+{
+ uint64_t buf = 1;
+
+ if (write(hw->kickfds[vq->vq_queue_index],
+ &buf, sizeof(uint64_t)) == -1)
+ rte_panic("%s\n", strerror(errno));
+}
+
+static const struct virtio_pci_ops vdev_ops = {
+ .read_dev_cfg = vdev_read_dev_config,
+ .write_dev_cfg = vdev_write_dev_config,
+ .reset = vdev_reset,
+ .get_status = vdev_get_status,
+ .set_status = vdev_set_status,
+ .get_features = vdev_get_features,
+ .set_features = vdev_set_features,
+ .get_isr = vdev_get_isr,
+ .set_config_irq = vdev_set_config_irq,
+ .get_queue_num = vdev_get_queue_num,
+ .setup_queue = vdev_setup_queue,
+ .del_queue = vdev_del_queue,
+ .notify_queue = vdev_notify_queue,
+};
+
+#define TUN_DEF_SNDBUF (1ull << 20)
+
+static void
+vhost_kernel_backend_setup(struct virtio_hw *hw, char *ifname)
+{
+ int fd;
+ int len = sizeof(struct virtio_net_hdr);
+ int req_mq = 0;
+ int sndbuf = TUN_DEF_SNDBUF;
+ unsigned int features;
+ struct ifreq ifr;
+
+ /* TODO:
+ * 1. get/set offload capability, tap_probe_has_ufo, tap_fd_set_offload
+ * 2. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len
+ * 3. get number of memory regions from vhost module parameter
+ * max_mem_regions, supported in newer version linux kernel
+ */
+
+ fd = open(PATH_NET_TUN, O_RDWR);
+ if (fd < 0)
+ rte_panic("open %s error, %s\n", PATH_NET_TUN, strerror(errno));
+
+ memset(&ifr, 0, sizeof(ifr));
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+ if (ioctl(fd, TUNGETFEATURES, &features) == -1)
+ rte_panic("TUNGETFEATURES failed: %s", strerror(errno));
+
+ if (features & IFF_ONE_QUEUE)
+ ifr.ifr_flags |= IFF_ONE_QUEUE;
+
+ if (features & IFF_VNET_HDR)
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ else
+ rte_panic("vnet_hdr requested, but kernel does not support\n");
+
+ if (req_mq) {
+ if (features & IFF_MULTI_QUEUE)
+ ifr.ifr_flags |= IFF_MULTI_QUEUE;
+ else
+ rte_panic("multiqueue requested, but kernel does not support\n");
+ }
+
+ if (ifname)
+ strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
+ else
+ strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
+ if (ioctl(fd, TUNSETIFF, (void *)&ifr) == -1)
+ rte_panic("TUNSETIFF failed: %s", strerror(errno));
+ fcntl(fd, F_SETFL, O_NONBLOCK);
+
+ if (ioctl(fd, TUNSETVNETHDRSZ, &len) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+
+ if (ioctl(fd, TUNSETSNDBUF, &sndbuf) == -1)
+ rte_panic("TUNSETSNDBUF failed: %s", strerror(errno));
+
+ hw->backfd = fd;
+ hw->vhostfd = open(hw->path, O_RDWR);
+ if (hw->vhostfd < 0)
+ rte_panic("open %s failed: %s\n", hw->path, strerror(errno));
+}
+
+static void
+vhost_user_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int flag;
+ struct sockaddr_un un;
+
+ fd = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (fd < 0)
+ rte_panic("socket error, %s\n", strerror(errno));
+
+ flag = fcntl(fd, F_GETFD);
+ fcntl(fd, F_SETFD, flag | FD_CLOEXEC);
+
+ memset(&un, 0, sizeof(un));
+ un.sun_family = AF_UNIX;
+ snprintf(un.sun_path, sizeof(un.sun_path), "%s", hw->path);
+ if (connect(fd, (struct sockaddr *)&un, sizeof(un)) < 0) {
+ PMD_DRV_LOG(ERR, "connect error, %s\n", strerror(errno));
+ rte_panic("connect error, %s\n", strerror(errno));
+ }
+
+ hw->vhostfd = fd;
+}
+
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac, char *ifname)
+{
+ int i, r;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->vtpci_ops = &vdev_ops;
+ hw->io_base = 0;
+ hw->use_msix = 0;
+ hw->modern = 0;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ r = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0],
+ &tmp[1], &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (r == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ } else
+ PMD_DRV_LOG(WARN, "wrong format of mac: %s", mac);
+ }
+
+ /* TODO: cq */
+
+ if (stat(hw->path, &s) < 0)
+ rte_panic("stat: %s failed, %s\n", hw->path, strerror(errno));
+
+ switch (s.st_mode & S_IFMT) {
+ case S_IFCHR:
+ hw->type = VHOST_KERNEL;
+ vhost_kernel_backend_setup(hw, ifname);
+ break;
+ case S_IFSOCK:
+ hw->type = VHOST_USER;
+ vhost_user_backend_setup(hw);
+ break;
+ default:
+ rte_panic("unknown file type of %s\n", hw->path);
+ }
+ if (vhost_call(hw, VHOST_MSG_SET_OWNER, NULL) == -1)
+ rte_panic("vhost set_owner failed: %s\n", strerror(errno));
+}
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index fed9571..fde77ca 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -123,5 +123,9 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)

-
+#ifdef RTE_VIRTIO_VDEV
+void virtio_vdev_init(struct rte_eth_dev_data *data, char *path, int nb_rx,
+ int nb_tx, int nb_cq, int queue_num, char *mac,
+ char *ifname);
+#endif
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 0544a07..a8394f8 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -150,7 +150,6 @@ struct virtqueue;
* rest are per-device feature bits.
*/
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32

/* The Guest publishes the used index for which it expects an interrupt
* at the end of the avail ring. Host should ignore the avail->flags field. */
@@ -266,6 +265,20 @@ struct virtio_hw {
struct virtio_pci_common_cfg *common_cfg;
struct virtio_net_config *dev_cfg;
const struct virtio_pci_ops *vtpci_ops;
+#ifdef RTE_VIRTIO_VDEV
+#define VHOST_KERNEL 0
+#define VHOST_USER 1
+ int type; /* type of backend */
+ uint32_t queue_num;
+ char *path;
+ int mac_specified;
+ int vhostfd;
+ int backfd; /* tap device used in vhost-net */
+ int callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ int kickfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ uint8_t status;
+ struct rte_eth_dev_data *data;
+#endif
};

/*
--
2.1.4
Michael S. Tsirkin
2016-02-07 10:47:44 UTC
Permalink
Post by Jianfeng Tan
diff --git a/drivers/net/virtio/vhost.h b/drivers/net/virtio/vhost.h
new file mode 100644
index 0000000..73d4f5c
--- /dev/null
+++ b/drivers/net/virtio/vhost.h
@@ -0,0 +1,194 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _VHOST_NET_USER_H
+#define _VHOST_NET_USER_H
+
+#include <stdint.h>
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VHOST_MEMORY_MAX_NREGIONS 8
Don't hard-code this, it's not nice.
Post by Jianfeng Tan
+
+struct vhost_vring_state {
+ unsigned int index;
+ unsigned int num;
+};
+
+struct vhost_vring_file {
+ unsigned int index;
+ int fd;
+};
+
+struct vhost_vring_addr {
+ unsigned int index;
+ /* Option flags. */
+ unsigned int flags;
+ /* Flag values: */
+ /* Whether log address is valid. If set enables logging. */
+#define VHOST_VRING_F_LOG 0
+
+ /* Start of array of descriptors (virtually contiguous) */
+ uint64_t desc_user_addr;
+ /* Used structure address. Must be 32 bit aligned */
+ uint64_t used_user_addr;
+ /* Available structure address. Must be 16 bit aligned */
+ uint64_t avail_user_addr;
+ /* Logging support. */
+ /* Log writes to used structure, at offset calculated from specified
+ * address. Address must be 32 bit aligned.
+ */
+ uint64_t log_guest_addr;
+};
+
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+
+enum vhost_user_request {
+ VHOST_USER_NONE = 0,
+ VHOST_USER_GET_FEATURES = 1,
+ VHOST_USER_SET_FEATURES = 2,
+ VHOST_USER_SET_OWNER = 3,
+ VHOST_USER_RESET_OWNER = 4,
+ VHOST_USER_SET_MEM_TABLE = 5,
+ VHOST_USER_SET_LOG_BASE = 6,
+ VHOST_USER_SET_LOG_FD = 7,
+ VHOST_USER_SET_VRING_NUM = 8,
+ VHOST_USER_SET_VRING_ADDR = 9,
+ VHOST_USER_SET_VRING_BASE = 10,
+ VHOST_USER_GET_VRING_BASE = 11,
+ VHOST_USER_SET_VRING_KICK = 12,
+ VHOST_USER_SET_VRING_CALL = 13,
+ VHOST_USER_SET_VRING_ERR = 14,
+ VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+ VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+ VHOST_USER_GET_QUEUE_NUM = 17,
+ VHOST_USER_SET_VRING_ENABLE = 18,
+ VHOST_USER_MAX
+};
+
+struct vhost_memory_region {
+ uint64_t guest_phys_addr;
+ uint64_t memory_size; /* bytes */
+ uint64_t userspace_addr;
+ uint64_t mmap_offset;
+};
+
+struct vhost_memory_kernel {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[0];
+};
+
+struct vhost_memory {
+ uint32_t nregions;
+ uint32_t padding;
+ struct vhost_memory_region regions[VHOST_MEMORY_MAX_NREGIONS];
+};
+
+struct vhost_user_msg {
+ enum vhost_user_request request;
+
+#define VHOST_USER_VERSION_MASK 0x3
+#define VHOST_USER_REPLY_MASK (0x1 << 2)
+ uint32_t flags;
+ uint32_t size; /* the following payload size */
+ union {
+#define VHOST_USER_VRING_IDX_MASK 0xff
+#define VHOST_USER_VRING_NOFD_MASK (0x1 << 8)
+ uint64_t u64;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr;
+ struct vhost_memory memory;
+ } payload;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+} __attribute((packed));
+
+#define VHOST_USER_HDR_SIZE offsetof(struct vhost_user_msg, payload.u64)
+#define VHOST_USER_PAYLOAD_SIZE (sizeof(struct vhost_user_msg) - VHOST_USER_HDR_SIZE)
+
+/* The version of the protocol we support */
+#define VHOST_USER_VERSION 0x1
+
+/* ioctls */
Why do you duplicate ioctls?
Use them from /usr/include/linux/vhost.h, etc.

In fact, what's not coming from linux here
comes from lib/librte_vhost/vhost_user/vhost-net-user.h.

I think you should reuse code, avoid code duplication.
Post by Jianfeng Tan
+
+#define VHOST_VIRTIO 0xAF
+
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
+#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
+#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+/*****************************************************************************/
+
+/* Ioctl defines */
+#define TUNSETIFF _IOW('T', 202, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
+#define TUNSETOFFLOAD _IOW('T', 208, unsigned int)
+#define TUNGETIFF _IOR('T', 210, unsigned int)
+#define TUNSETSNDBUF _IOW('T', 212, int)
+#define TUNGETVNETHDRSZ _IOR('T', 215, int)
+#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE _IOW('T', 217, int)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNSETVNETBE _IOW('T', 222, int)
+
+/* TUNSETIFF ifr flags */
+#define IFF_TAP 0x0002
+#define IFF_NO_PI 0x1000
+#define IFF_ONE_QUEUE 0x2000
+#define IFF_VNET_HDR 0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
+
+/* Features for GSO (TUNSETOFFLOAD). */
+#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
+#define TUN_F_TSO4 0x02 /* I can handle TSO for IPv4 packets */
+#define TUN_F_TSO6 0x04 /* I can handle TSO for IPv6 packets */
+#define TUN_F_TSO_ECN 0x08 /* I can handle TSO with ECN bits. */
+#define TUN_F_UFO 0x10 /* I can handle UFO packets */
+
+#define PATH_NET_TUN "/dev/net/tun"
+
+#endif
diff --git a/drivers/net/virtio/vhost_embedded.c b/drivers/net/virtio/vhost_embedded.c
new file mode 100644
index 0000000..0073b86
--- /dev/null
+++ b/drivers/net/virtio/vhost_embedded.c
@@ -0,0 +1,809 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "virtio_pci.h"
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "vhost.h"
+
+static int
+vhost_user_write(int fd, void *buf, int len, int *fds, int fd_num)
+{
+ int r;
+ struct msghdr msgh;
+ struct iovec iov;
+ size_t fd_size = fd_num * sizeof(int);
+ char control[CMSG_SPACE(fd_size)];
+ struct cmsghdr *cmsg;
+
+ bzero(&msgh, sizeof(msgh));
+ bzero(control, sizeof(control));
+
+ iov.iov_base = (uint8_t *)buf;
+ iov.iov_len = len;
+
+ msgh.msg_iov = &iov;
+ msgh.msg_iovlen = 1;
+ msgh.msg_control = control;
+ msgh.msg_controllen = sizeof(control);
+
+ cmsg = CMSG_FIRSTHDR(&msgh);
+ cmsg->cmsg_len = CMSG_LEN(fd_size);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ memcpy(CMSG_DATA(cmsg), fds, fd_size);
+
+ do {
+ r = sendmsg(fd, &msgh, 0);
+ } while (r < 0 && errno == EINTR);
+
+ return r;
+}
+
+static int
+vhost_user_read(int fd, struct vhost_user_msg *msg)
+{
+ uint32_t valid_flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+ int ret, sz_hdr = VHOST_USER_HDR_SIZE, sz_payload;
+
+ ret = recv(fd, (void *)msg, sz_hdr, 0);
+ if (ret < sz_hdr) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg hdr: %d instead of %d.",
+ ret, sz_hdr);
+ goto fail;
+ }
+
+ /* validate msg flags */
+ if (msg->flags != (valid_flags)) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg: flags %x instead of %x.",
+ msg->flags, valid_flags);
+ goto fail;
+ }
+
+ sz_payload = msg->size;
+ if (sz_payload) {
+ ret = recv(fd, (void *)((char *)msg + sz_hdr), sz_payload, 0);
+ if (ret < sz_payload) {
+ PMD_DRV_LOG(ERR, "Failed to recv msg payload: %d instead of %d.",
+ ret, msg->size);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+ return -1;
+}
+
+static struct vhost_user_msg m __rte_unused;
+
+static void
+prepare_vhost_memory_user(struct vhost_user_msg *msg, int fds[])
+{
+ int i, num;
+ struct back_file *huges;
+ struct vhost_memory_region *mr;
+
+ num = rte_eal_get_backfile_info(&huges);
+
+ if (num > VHOST_MEMORY_MAX_NREGIONS)
+ rte_panic("%d files exceed maximum of %d for vhost-user\n",
+ num, VHOST_MEMORY_MAX_NREGIONS);
+
+ for (i = 0; i < num; ++i) {
+ mr = &msg->payload.memory.regions[i];
+ mr->guest_phys_addr = (uint64_t)huges[i].addr; /* use vaddr! */
+ mr->userspace_addr = (uint64_t)huges[i].addr;
+ mr->memory_size = huges[i].size;
+ mr->mmap_offset = 0;
+ fds[i] = open(huges[i].filepath, O_RDWR);
+ }
+
+ msg->payload.memory.nregions = num;
+ msg->payload.memory.padding = 0;
+ free(huges);
+}
+
+static int
+vhost_user_sock(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ struct vhost_user_msg msg;
+ struct vhost_vring_file *file = 0;
+ int need_reply = 0;
+ int fds[VHOST_MEMORY_MAX_NREGIONS];
+ int fd_num = 0;
+ int i, len;
+
+ msg.request = req;
+ msg.flags = VHOST_USER_VERSION;
+ msg.size = 0;
+
+ switch (req) {
+ need_reply = 1;
+ break;
+
+ msg.payload.u64 = *((__u64 *)arg);
+ msg.size = sizeof(m.payload.u64);
+ break;
+
+ break;
+
+ prepare_vhost_memory_user(&msg, fds);
+ fd_num = msg.payload.memory.nregions;
+ msg.size = sizeof(m.payload.memory.nregions);
+ msg.size += sizeof(m.payload.memory.padding);
+ msg.size += fd_num * sizeof(struct vhost_memory_region);
+ break;
+
+ fds[fd_num++] = *((int *)arg);
+ break;
+
+ memcpy(&msg.payload.state, arg, sizeof(msg.payload.state));
+ msg.size = sizeof(m.payload.state);
+ break;
+
+ memcpy(&msg.payload.state, arg, sizeof(msg.payload.state));
+ msg.size = sizeof(m.payload.state);
+ need_reply = 1;
+ break;
+
+ memcpy(&msg.payload.addr, arg, sizeof(msg.payload.addr));
+ msg.size = sizeof(m.payload.addr);
+ break;
+
+ file = arg;
+ msg.payload.u64 = file->index & VHOST_USER_VRING_IDX_MASK;
+ msg.size = sizeof(m.payload.u64);
+ if (file->fd > 0)
+ fds[fd_num++] = file->fd;
+ else
+ msg.payload.u64 |= VHOST_USER_VRING_NOFD_MASK;
+ break;
+
+ PMD_DRV_LOG(ERR, "vhost-user trying to send unhandled msg type");
+ return -1;
+ }
+
+ len = VHOST_USER_HDR_SIZE + msg.size;
+ if (vhost_user_write(hw->vhostfd, &msg, len, fds, fd_num) < 0)
+ return 0;
+
+ if (req == VHOST_USER_SET_MEM_TABLE)
+ for (i = 0; i < fd_num; ++i)
+ close(fds[i]);
+
+ if (need_reply) {
+ if (vhost_user_read(hw->vhostfd, &msg) < 0)
+ return -1;
+
+ if (req != msg.request) {
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+
+ switch (req) {
+ if (msg.size != sizeof(m.payload.u64)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ *((__u64 *)arg) = msg.payload.u64;
+ break;
+ if (msg.size != sizeof(m.payload.state)) {
+ PMD_DRV_LOG(ERR, "Received bad msg size.");
+ return -1;
+ }
+ memcpy(arg, &msg.payload.state,
+ sizeof(struct vhost_vring_state));
+ break;
+ PMD_DRV_LOG(ERR, "Received unexpected msg type.");
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
+static int
+vhost_kernel_ioctl(struct virtio_hw *hw, unsigned long int req, void *arg)
+{
+ return ioctl(hw->vhostfd, req, arg);
+}
+
+enum {
+ VHOST_MSG_SET_OWNER,
+ VHOST_MSG_SET_FEATURES,
+ VHOST_MSG_GET_FEATURES,
+ VHOST_MSG_SET_VRING_CALL,
+ VHOST_MSG_SET_VRING_NUM,
+ VHOST_MSG_SET_VRING_BASE,
+ VHOST_MSG_GET_VRING_BASE,
+ VHOST_MSG_SET_VRING_ADDR,
+ VHOST_MSG_SET_VRING_KICK,
+ VHOST_MSG_SET_MEM_TABLE,
+ VHOST_MSG_MAX,
+};
+
+static const char * const vhost_msg_strings[] = {
+ [VHOST_MSG_SET_OWNER] = "VHOST_MSG_SET_OWNER",
+ [VHOST_MSG_SET_FEATURES] = "VHOST_MSG_SET_FEATURES",
+ [VHOST_MSG_GET_FEATURES] = "VHOST_MSG_GET_FEATURES",
+ [VHOST_MSG_SET_VRING_CALL] = "VHOST_MSG_SET_VRING_CALL",
+ [VHOST_MSG_SET_VRING_NUM] = "VHOST_MSG_SET_VRING_NUM",
+ [VHOST_MSG_SET_VRING_BASE] = "VHOST_MSG_SET_VRING_BASE",
+ [VHOST_MSG_GET_VRING_BASE] = "VHOST_MSG_GET_VRING_BASE",
+ [VHOST_MSG_SET_VRING_ADDR] = "VHOST_MSG_SET_VRING_ADDR",
+ [VHOST_MSG_SET_VRING_KICK] = "VHOST_MSG_SET_VRING_KICK",
+ [VHOST_MSG_SET_MEM_TABLE] = "VHOST_MSG_SET_MEM_TABLE",
+ NULL,
+};
+
+static unsigned long int vhost_req_map[][2] = {
+ [VHOST_MSG_SET_OWNER] = {
+ VHOST_SET_OWNER, VHOST_USER_SET_OWNER
+ },
+ [VHOST_MSG_SET_FEATURES] = {
+ VHOST_SET_FEATURES, VHOST_USER_SET_FEATURES
+ },
+ [VHOST_MSG_GET_FEATURES] = {
+ VHOST_GET_FEATURES, VHOST_USER_GET_FEATURES
+ },
+ [VHOST_MSG_SET_VRING_CALL] = {
+ VHOST_SET_VRING_CALL, VHOST_USER_SET_VRING_CALL
+ },
+ [VHOST_MSG_SET_VRING_NUM] = {
+ VHOST_SET_VRING_NUM, VHOST_USER_SET_VRING_NUM
+ },
+ [VHOST_MSG_SET_VRING_BASE] = {
+ VHOST_SET_VRING_BASE, VHOST_USER_SET_VRING_BASE
+ },
+ [VHOST_MSG_GET_VRING_BASE] = {
+ VHOST_GET_VRING_BASE, VHOST_USER_GET_VRING_BASE
+ },
+ [VHOST_MSG_SET_VRING_ADDR] = {
+ VHOST_SET_VRING_ADDR, VHOST_USER_SET_VRING_ADDR
+ },
+ [VHOST_MSG_SET_VRING_KICK] = {
+ VHOST_SET_VRING_KICK, VHOST_USER_SET_VRING_KICK
+ },
+ [VHOST_MSG_SET_MEM_TABLE] = {
+ VHOST_SET_MEM_TABLE, VHOST_USER_SET_MEM_TABLE
+ },
+};
+
+static int
+vhost_call(struct virtio_hw *hw, unsigned long int req_orig, void *arg)
+{
+ unsigned long int req_new;
+ int ret;
+
+ if (req_orig >= VHOST_MSG_MAX)
+ rte_panic("invalid req: %lu\n", req_orig);
+
+ PMD_DRV_LOG(INFO, "%s\n", vhost_msg_strings[req_orig]);
+ req_new = vhost_req_map[req_orig][hw->type];
+ if (hw->type == VHOST_USER)
+ ret = vhost_user_sock(hw, req_new, arg);
+ else
+ ret = vhost_kernel_ioctl(hw, req_new, arg);
+
+ if (ret < 0)
+ rte_panic("vhost_call %s failed: %s\n",
+ vhost_msg_strings[req_orig], strerror(errno));
+
+ return ret;
+}
+
+static void
+kick_one_vq(struct virtio_hw *hw, struct virtqueue *vq, unsigned queue_sel)
+{
+ int callfd, kickfd;
+ struct vhost_vring_file file;
+ struct vhost_vring_state state;
+ struct vhost_vring_addr addr = {
+ .index = queue_sel,
+ .desc_user_addr = (uint64_t)(uintptr_t)vq->vq_ring.desc,
+ .avail_user_addr = (uint64_t)(uintptr_t)vq->vq_ring.avail,
+ .used_user_addr = (uint64_t)(uintptr_t)vq->vq_ring.used,
+ .log_guest_addr = 0,
+ .flags = 0, /* disable log */
+ };
+
+ /* or use invalid flag to disable it, but vhost-dpdk uses this to judge
+ * if dev is alive. so finally we need two real event_fds.
+ */
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_CALL come
+ * firstly because vhost depends on this msg to allocate virtqueue
+ * pair.
+ */
+ callfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (callfd < 0)
+ rte_panic("callfd error, %s\n", strerror(errno));
+
+ file.index = queue_sel;
+ file.fd = callfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_CALL, &file);
+ hw->callfds[queue_sel] = callfd;
+
+ state.index = queue_sel;
+ state.num = vq->vq_ring.num;
+ vhost_call(hw, VHOST_MSG_SET_VRING_NUM, &state);
+
+ state.num = 0; /* no reservation */
+ vhost_call(hw, VHOST_MSG_SET_VRING_BASE, &state);
+
+ vhost_call(hw, VHOST_MSG_SET_VRING_ADDR, &addr);
+
+ /* Of all per virtqueue MSGs, make sure VHOST_SET_VRING_KICK comes
+ * lastly because vhost depends on this msg to judge if
+ * virtio_is_ready().
+ */
+ kickfd = eventfd(0, O_CLOEXEC | O_NONBLOCK);
+ if (kickfd < 0)
+ rte_panic("kickfd error, %s\n", strerror(errno));
+
+ file.fd = kickfd;
+ vhost_call(hw, VHOST_MSG_SET_VRING_KICK, &file);
+ hw->kickfds[queue_sel] = kickfd;
+}
+
+/**
+ * Merge those virtually adjacent memsegs into one region.
+ */
+static void
+prepare_vhost_memory_kernel(struct vhost_memory_kernel **p_vm)
+{
+ unsigned i, j, k = 0;
+ struct rte_memseg *seg;
+ struct vhost_memory_region *mr;
+ struct vhost_memory_kernel *vm;
+
+ vm = malloc(sizeof(struct vhost_memory_kernel) +
+ RTE_MAX_MEMSEG * sizeof(struct vhost_memory_region));
+
+ for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
+ seg = &rte_eal_get_configuration()->mem_config->memseg[i];
+ if (!seg->addr)
+ break;
+
+ int new_region = 1;
+
+ for (j = 0; j < k; ++j) {
+ mr = &vm->regions[j];
+
+ if (mr->userspace_addr + mr->memory_size ==
+ (uint64_t)seg->addr) {
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+
+ if ((uint64_t)seg->addr + seg->len ==
+ mr->userspace_addr) {
+ mr->guest_phys_addr = (uint64_t)seg->addr;
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size += seg->len;
+ new_region = 0;
+ break;
+ }
+ }
+
+ if (new_region == 0)
+ continue;
+
+ mr = &vm->regions[k++];
+ mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
+ mr->userspace_addr = (uint64_t)seg->addr;
+ mr->memory_size = seg->len;
+ mr->mmap_offset = 0;
+ }
+
+ vm->nregions = k;
+ vm->padding = 0;
+ *p_vm = vm;
+}
+
+static void kick_all_vq(struct virtio_hw *hw)
+{
+ uint64_t features;
+ unsigned i, queue_sel, nvqs;
+ struct rte_eth_dev_data *data = hw->data;
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_memory_kernel *vm = NULL;
+
+ prepare_vhost_memory_kernel(&vm);
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, vm);
+ free(vm);
+ } else {
+ /* construct vhost_memory inside prepare_vhost_memory_user() */
+ vhost_call(hw, VHOST_MSG_SET_MEM_TABLE, NULL);
+ }
+
+ for (i = 0; i < data->nb_rx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_RQ_QUEUE_IDX;
+ kick_one_vq(hw, data->rx_queues[i], queue_sel);
+ }
+ for (i = 0; i < data->nb_tx_queues; ++i) {
+ queue_sel = 2 * i + VTNET_SQ_TQ_QUEUE_IDX;
+ kick_one_vq(hw, data->tx_queues[i], queue_sel);
+ }
+
+ /* after setup all virtqueues, we need to set_features again
+ * so that these features can be set into each virtqueue in
+ * vhost side.
+ */
+ features = hw->guest_features;
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+ if (hw->type == VHOST_KERNEL)
+ if (ioctl(hw->backfd, TUNSETVNETHDRSZ,
+ &hw->vtnet_hdr_size) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n",
+ strerror(errno));
+ PMD_DRV_LOG(INFO, "set features:%" PRIx64 "\n", features);
+
+ if (hw->type == VHOST_KERNEL) {
+ struct vhost_vring_file file;
+
+ file.fd = hw->backfd;
+ nvqs = data->nb_rx_queues + data->nb_tx_queues;
+ for (file.index = 0; file.index < nvqs; ++file.index) {
+ if (vhost_kernel_ioctl(hw, VHOST_NET_SET_BACKEND,
+ &file) < 0)
+ rte_panic("VHOST_NET_SET_BACKEND failed, %s\n",
+ strerror(errno));
+ }
+ }
+}
+
+static void
+vdev_read_dev_config(struct virtio_hw *hw, uint64_t offset,
+ void *dst, int length)
+{
+ if (offset == offsetof(struct virtio_net_config, mac) &&
+ length == ETHER_ADDR_LEN) {
+ int i;
+
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ ((uint8_t *)dst)[i] = hw->mac_addr[i];
+ return;
+ }
+
+ if (offset == offsetof(struct virtio_net_config, status))
+ *(uint16_t *)dst = hw->status;
+
+ if (offset == offsetof(struct virtio_net_config, max_virtqueue_pairs))
+ *(uint16_t *)dst = hw->max_tx_queues;
+}
+
+static void
+vdev_write_dev_config(struct virtio_hw *hw, uint64_t offset,
+ const void *src, int length)
+{
+ int i;
+
+ if ((offset == offsetof(struct virtio_net_config, mac)) &&
+ (length == ETHER_ADDR_LEN))
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = ((const uint8_t *)src)[i];
+ else
+ rte_panic("offset=%" PRIu64 ", length=%d\n", offset, length);
+}
+
+static void
+vdev_set_status(struct virtio_hw *hw, uint8_t status)
+{
+ if (status & VIRTIO_CONFIG_S_DRIVER_OK)
+ kick_all_vq(hw);
+ hw->status = status;
+}
+
+static void
+vdev_reset(struct virtio_hw *hw __rte_unused)
+{
+ /* do nothing according to qemu vhost user spec */
+}
+
+static uint8_t
+vdev_get_status(struct virtio_hw *hw)
+{
+ return hw->status;
+}
+
+static uint64_t
+vdev_get_features(struct virtio_hw *hw)
+{
+ uint64_t host_features;
+
+ vhost_call(hw, VHOST_MSG_GET_FEATURES, &host_features);
+ if (hw->mac_specified)
+ host_features |= (1ull << VIRTIO_NET_F_MAC);
+ /* disable it until we support CQ */
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_VQ);
+ host_features &= ~(1ull << VIRTIO_NET_F_CTRL_RX);
+ return host_features;
+}
+
+static void
+vdev_set_features(struct virtio_hw *hw, uint64_t features)
+{
+ features &= ~(1ull << VIRTIO_NET_F_MAC);
+ vhost_call(hw, VHOST_MSG_SET_FEATURES, &features);
+}
+
+static uint8_t
+vdev_get_isr(struct virtio_hw *hw __rte_unused)
+{
+ rte_panic("");
+}
+
+static uint16_t
+vdev_set_config_irq(struct virtio_hw *hw __rte_unused,
+ uint16_t vec __rte_unused)
+{
+ rte_panic("");
+}
+
+static uint16_t
+vdev_get_queue_num(struct virtio_hw *hw,
+ uint16_t queue_id __rte_unused)
+{
+ return hw->queue_num;
+}
+
+static void
+vdev_setup_queue(struct virtio_hw *hw __rte_unused,
+ struct virtqueue *vq __rte_unused)
+{
+ /* do nothing */
+}
+
+static void
+vdev_del_queue(struct virtio_hw *hw __rte_unused,
+ struct virtqueue *vq)
+{
+ struct vhost_vring_state state = {
+ .index = vq->vq_queue_index,
+ };
+
+ vhost_call(hw, VHOST_MSG_GET_VRING_BASE, &state);
+ PMD_DRV_LOG(DEBUG, "state.num = %d\n", state.num);
+}
+
+static void
+vdev_notify_queue(struct virtio_hw *hw, struct virtqueue *vq)
+{
+ uint64_t buf = 1;
+
+ if (write(hw->kickfds[vq->vq_queue_index],
+ &buf, sizeof(uint64_t)) == -1)
+ rte_panic("%s\n", strerror(errno));
+}
+
+static const struct virtio_pci_ops vdev_ops = {
+ .read_dev_cfg = vdev_read_dev_config,
+ .write_dev_cfg = vdev_write_dev_config,
+ .reset = vdev_reset,
+ .get_status = vdev_get_status,
+ .set_status = vdev_set_status,
+ .get_features = vdev_get_features,
+ .set_features = vdev_set_features,
+ .get_isr = vdev_get_isr,
+ .set_config_irq = vdev_set_config_irq,
+ .get_queue_num = vdev_get_queue_num,
+ .setup_queue = vdev_setup_queue,
+ .del_queue = vdev_del_queue,
+ .notify_queue = vdev_notify_queue,
+};
+
+#define TUN_DEF_SNDBUF (1ull << 20)
+
+static void
+vhost_kernel_backend_setup(struct virtio_hw *hw, char *ifname)
+{
+ int fd;
+ int len = sizeof(struct virtio_net_hdr);
+ int req_mq = 0;
+ int sndbuf = TUN_DEF_SNDBUF;
+ unsigned int features;
+ struct ifreq ifr;
+
+ * 1. get/set offload capability, tap_probe_has_ufo, tap_fd_set_offload
+ * 2. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len
+ * 3. get number of memory regions from vhost module parameter
+ * max_mem_regions, supported in newer version linux kernel
+ */
+
+ fd = open(PATH_NET_TUN, O_RDWR);
+ if (fd < 0)
+ rte_panic("open %s error, %s\n", PATH_NET_TUN, strerror(errno));
+
+ memset(&ifr, 0, sizeof(ifr));
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+ if (ioctl(fd, TUNGETFEATURES, &features) == -1)
+ rte_panic("TUNGETFEATURES failed: %s", strerror(errno));
+
+ if (features & IFF_ONE_QUEUE)
+ ifr.ifr_flags |= IFF_ONE_QUEUE;
+
+ if (features & IFF_VNET_HDR)
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ else
+ rte_panic("vnet_hdr requested, but kernel does not support\n");
+
+ if (req_mq) {
+ if (features & IFF_MULTI_QUEUE)
+ ifr.ifr_flags |= IFF_MULTI_QUEUE;
+ else
+ rte_panic("multiqueue requested, but kernel does not support\n");
+ }
+
+ if (ifname)
+ strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
+ else
+ strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
+ if (ioctl(fd, TUNSETIFF, (void *)&ifr) == -1)
+ rte_panic("TUNSETIFF failed: %s", strerror(errno));
+ fcntl(fd, F_SETFL, O_NONBLOCK);
+
+ if (ioctl(fd, TUNSETVNETHDRSZ, &len) == -1)
+ rte_panic("TUNSETVNETHDRSZ failed: %s\n", strerror(errno));
+
+ if (ioctl(fd, TUNSETSNDBUF, &sndbuf) == -1)
+ rte_panic("TUNSETSNDBUF failed: %s", strerror(errno));
+
+ hw->backfd = fd;
+ hw->vhostfd = open(hw->path, O_RDWR);
+ if (hw->vhostfd < 0)
+ rte_panic("open %s failed: %s\n", hw->path, strerror(errno));
+}
+
+static void
+vhost_user_backend_setup(struct virtio_hw *hw)
+{
+ int fd;
+ int flag;
+ struct sockaddr_un un;
+
+ fd = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (fd < 0)
+ rte_panic("socket error, %s\n", strerror(errno));
+
+ flag = fcntl(fd, F_GETFD);
+ fcntl(fd, F_SETFD, flag | FD_CLOEXEC);
+
+ memset(&un, 0, sizeof(un));
+ un.sun_family = AF_UNIX;
+ snprintf(un.sun_path, sizeof(un.sun_path), "%s", hw->path);
+ if (connect(fd, (struct sockaddr *)&un, sizeof(un)) < 0) {
+ PMD_DRV_LOG(ERR, "connect error, %s\n", strerror(errno));
+ rte_panic("connect error, %s\n", strerror(errno));
+ }
+
+ hw->vhostfd = fd;
+}
+
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac, char *ifname)
+{
+ int i, r;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->vtpci_ops = &vdev_ops;
+ hw->io_base = 0;
+ hw->use_msix = 0;
+ hw->modern = 0;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ r = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0],
+ &tmp[1], &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (r == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ } else
+ PMD_DRV_LOG(WARN, "wrong format of mac: %s", mac);
+ }
+
+ /* TODO: cq */
+
+ if (stat(hw->path, &s) < 0)
+ rte_panic("stat: %s failed, %s\n", hw->path, strerror(errno));
+
+ switch (s.st_mode & S_IFMT) {
+ hw->type = VHOST_KERNEL;
+ vhost_kernel_backend_setup(hw, ifname);
+ break;
+ hw->type = VHOST_USER;
+ vhost_user_backend_setup(hw);
+ break;
+ rte_panic("unknown file type of %s\n", hw->path);
+ }
+ if (vhost_call(hw, VHOST_MSG_SET_OWNER, NULL) == -1)
+ rte_panic("vhost set_owner failed: %s\n", strerror(errno));
+}
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index fed9571..fde77ca 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -123,5 +123,9 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)
-
+#ifdef RTE_VIRTIO_VDEV
+void virtio_vdev_init(struct rte_eth_dev_data *data, char *path, int nb_rx,
+ int nb_tx, int nb_cq, int queue_num, char *mac,
+ char *ifname);
+#endif
#endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 0544a07..a8394f8 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -150,7 +150,6 @@ struct virtqueue;
* rest are per-device feature bits.
*/
#define VIRTIO_TRANSPORT_F_START 28
-#define VIRTIO_TRANSPORT_F_END 32
/* The Guest publishes the used index for which it expects an interrupt
* at the end of the avail ring. Host should ignore the avail->flags field. */
@@ -266,6 +265,20 @@ struct virtio_hw {
struct virtio_pci_common_cfg *common_cfg;
struct virtio_net_config *dev_cfg;
const struct virtio_pci_ops *vtpci_ops;
+#ifdef RTE_VIRTIO_VDEV
+#define VHOST_KERNEL 0
+#define VHOST_USER 1
+ int type; /* type of backend */
+ uint32_t queue_num;
+ char *path;
+ int mac_specified;
+ int vhostfd;
+ int backfd; /* tap device used in vhost-net */
+ int callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ int kickfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
+ uint8_t status;
+ struct rte_eth_dev_data *data;
+#endif
};
/*
--
2.1.4
Tetsuya Mukawa
2016-02-08 06:59:38 UTC
Permalink
Post by Jianfeng Tan
To implement virtio vdev, we need way to interract with vhost backend.
And more importantly, needs way to emulate a device into DPDK. So this
patch acts as embedded device emulation.
Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.
---
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac, char *ifname)
+{
+ int i, r;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->vtpci_ops = &vdev_ops;
+ hw->io_base = 0;
+ hw->use_msix = 0;
+ hw->modern = 0;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ r = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0],
+ &tmp[1], &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (r == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ } else
+ PMD_DRV_LOG(WARN, "wrong format of mac: %s", mac);
It seems you cannot use 'WARN' here.

Thanks,
Tetsuya
Tan, Jianfeng
2016-02-16 02:47:37 UTC
Permalink
Hi Tetsuya,
Post by Tetsuya Mukawa
Post by Jianfeng Tan
To implement virtio vdev, we need way to interract with vhost backend.
And more importantly, needs way to emulate a device into DPDK. So this
patch acts as embedded device emulation.
Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.
---
+void
+virtio_vdev_init(struct rte_eth_dev_data *data, char *path,
+ int nb_rx, int nb_tx, int nb_cq __attribute__ ((unused)),
+ int queue_num, char *mac, char *ifname)
+{
+ int i, r;
+ struct stat s;
+ uint32_t tmp[ETHER_ADDR_LEN];
+ struct virtio_hw *hw = data->dev_private;
+
+ hw->vtpci_ops = &vdev_ops;
+ hw->io_base = 0;
+ hw->use_msix = 0;
+ hw->modern = 0;
+
+ hw->data = data;
+ hw->path = strdup(path);
+ hw->max_rx_queues = nb_rx;
+ hw->max_tx_queues = nb_tx;
+ hw->queue_num = queue_num;
+ hw->mac_specified = 0;
+ if (mac) {
+ r = sscanf(mac, "%x:%x:%x:%x:%x:%x", &tmp[0],
+ &tmp[1], &tmp[2], &tmp[3], &tmp[4], &tmp[5]);
+ if (r == ETHER_ADDR_LEN) {
+ for (i = 0; i < ETHER_ADDR_LEN; ++i)
+ hw->mac_addr[i] = (uint8_t)tmp[i];
+ hw->mac_specified = 1;
+ } else
+ PMD_DRV_LOG(WARN, "wrong format of mac: %s", mac);
It seems you cannot use 'WARN' here.
Thanks, I'll change it to ERR.

Thanks,
Jianfeng
Post by Tetsuya Mukawa
Thanks,
Tetsuya
Tan, Jianfeng
2016-02-16 02:40:46 UTC
Permalink
Hi Michael,

Don't know why, I have not received the email you comment on this commit.
Post by Jianfeng Tan
To implement virtio vdev, we need way to interract with vhost backend.
And more importantly, needs way to emulate a device into DPDK. So this
patch acts as embedded device emulation.
Depends on the type of vhost file: vhost-user is used if the given
path points to a unix socket; vhost-net is used if the given path
points to a char device.
---
config/common_linuxapp | 5 +
drivers/net/virtio/Makefile | 4 +
drivers/net/virtio/vhost.h | 194 +++++++++
drivers/net/virtio/vhost_embedded.c | 809 ++++++++++++++++++++++++++++++++++++
drivers/net/virtio/virtio_ethdev.h | 6 +-
drivers/net/virtio/virtio_pci.h | 15 +-
6 files changed, 1031 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/virtio/vhost.h
create mode 100644 drivers/net/virtio/vhost_embedded.c
...
Post by Jianfeng Tan
Don't hard-code this, it's not nice.
Actually, it comes from in lib/librte_vhost/rte_virtio_net.h. If we
follow your suggestion below, it'll be addressed.
Post by Jianfeng Tan
Why do you duplicate ioctls?
Use them from /usr/include/linux/vhost.h, etc.
In fact, what's not coming from linux here
comes from lib/librte_vhost/vhost_user/vhost-net-user.h.
I think you should reuse code, avoid code duplication.
The reason I was considering is:
a. If we include /usr/include/linux/vhost.h, then virtio cannot be used
in FreeBSD.
b. To use definitions in lib/librte_vhost/vhost_user/vhost-net-user.h,
we need expose this header file outside.

Thanks,
Jianfeng
Jianfeng Tan
2016-02-05 11:20:27 UTC
Permalink
Add a new virtual device named eth_cvio, it can be used just like
eth_ring, eth_null, etc.

Configured parameters include:
- rx (optional, 1 by default), number of rx, not used for now.
- tx (optional, 1 by default), number of tx, not used for now.
- cq (optional, 0 by default), if CQ is enabled, not supported for now.
- mac (optional), random value will be given if not specified.
- queue_num (optional, 256 by default), size of virtqueue.
- path (madatory), path of vhost, depends on the file type, vhost-user
if the given path points to a unix socket; vhost-net if the given
path points to a char device.
- ifname (optional), specify the name of backend tap device; only valid
when backend is vhost-net.

The major difference with original virtio for vm is that, here we use
virtual addr instead of physical addr for vhost to calculate relative
address.

When enable CONFIG_RTE_VIRTIO_VDEV (enabled by default), the compiled
library can be used in both VM and container environment.

Examples:
a. Use vhost-net as a backend
sudo numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 \
-m 1024 --no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=/dev/vhost-net \
-- -p 0x1

b. Use vhost-user as a backend
numactl -N 1 -m 1 ./examples/l2fwd/build/l2fwd -c 0x100000 -n 4 -m 1024 \
--no-pci --single-file --file-prefix=l2fwd \
--vdev=eth_cvio0,mac=00:01:02:03:04:05,path=<path_to_vhost_user> \
-- -p 0x1

Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 329 +++++++++++++++++++++++++-------
drivers/net/virtio/virtio_rxtx.c | 6 +-
drivers/net/virtio/virtio_rxtx_simple.c | 13 +-
drivers/net/virtio/virtqueue.h | 15 +-
4 files changed, 282 insertions(+), 81 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 755503d..b790fd0 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -52,6 +52,7 @@
#include <rte_memory.h>
#include <rte_eal.h>
#include <rte_dev.h>
+#include <rte_kvargs.h>

#include "virtio_ethdev.h"
#include "virtio_pci.h"
@@ -170,14 +171,14 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
* One RX packet for ACK.
*/
vq->vq_ring.desc[head].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mz->phys_addr;
+ vq->vq_ring.desc[head].addr = vq->virtio_net_hdr_mem;
vq->vq_ring.desc[head].len = sizeof(struct virtio_net_ctrl_hdr);
vq->vq_free_cnt--;
i = vq->vq_ring.desc[head].next;

for (k = 0; k < pkt_num; k++) {
vq->vq_ring.desc[i].flags = VRING_DESC_F_NEXT;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr)
+ sizeof(ctrl->status) + sizeof(uint8_t)*sum;
vq->vq_ring.desc[i].len = dlen[k];
@@ -187,7 +188,7 @@ virtio_send_command(struct virtqueue *vq, struct virtio_pmd_ctrl *ctrl,
}

vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
- vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mz->phys_addr
+ vq->vq_ring.desc[i].addr = vq->virtio_net_hdr_mem
+ sizeof(struct virtio_net_ctrl_hdr);
vq->vq_ring.desc[i].len = sizeof(ctrl->status);
vq->vq_free_cnt--;
@@ -366,70 +367,85 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
}
}

- /*
- * Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
- * and only accepts 32 bit page frame number.
- * Check if the allocated physical memory exceeds 16TB.
- */
- if ((mz->phys_addr + vq->vq_ring_size - 1) >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
- PMD_INIT_LOG(ERR, "vring address shouldn't be above 16TB!");
- rte_free(vq);
- return -ENOMEM;
- }
-
memset(mz->addr, 0, sizeof(mz->len));
vq->mz = mz;
- vq->vq_ring_mem = mz->phys_addr;
vq->vq_ring_virt_mem = mz->addr;
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%"PRIx64, (uint64_t)mz->phys_addr);
- PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, (uint64_t)(uintptr_t)mz->addr);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ vq->vq_ring_mem = mz->phys_addr;
+
+ /* Virtio PCI device VIRTIO_PCI_QUEUE_PF register is 32bit,
+ * and only accepts 32 bit page frame number.
+ * Check if the allocated physical memory exceeds 16TB.
+ */
+ uint64_t last_physaddr = vq->vq_ring_mem + vq->vq_ring_size - 1;
+
+ if (last_physaddr >> (VIRTIO_PCI_QUEUE_ADDR_SHIFT + 32)) {
+ PMD_INIT_LOG(ERR,
+ "vring address shouldn't be above 16TB!");
+ rte_free(vq);
+ return -ENOMEM;
+ }
+ }
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->vq_ring_mem = (phys_addr_t)mz->addr; /* Use vaddr!!! */
+#endif
+
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem: 0x%" PRIx64,
+ (uint64_t)vq->vq_ring_mem);
+ PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%" PRIx64,
+ (uint64_t)(uintptr_t)vq->vq_ring_virt_mem);
vq->virtio_net_hdr_mz = NULL;
vq->virtio_net_hdr_mem = 0;

+ uint64_t hdr_size = 0;
if (queue_type == VTNET_TQ) {
/*
* For each xmit packet, allocate a virtio_net_hdr
*/
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d_hdrzone",
dev->data->port_id, queue_idx);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- vq_size * hw->vtnet_hdr_size,
- socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
- if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
- rte_free(vq);
- return -ENOMEM;
- }
- }
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0,
- vq_size * hw->vtnet_hdr_size);
+ hdr_size = vq_size * hw->vtnet_hdr_size;
} else if (queue_type == VTNET_CQ) {
/* Allocate a page for control vq command, data and status */
snprintf(vq_name, sizeof(vq_name), "port%d_cvq_hdrzone",
dev->data->port_id);
- vq->virtio_net_hdr_mz = rte_memzone_reserve_aligned(vq_name,
- PAGE_SIZE, socket_id, 0, RTE_CACHE_LINE_SIZE);
- if (vq->virtio_net_hdr_mz == NULL) {
+ hdr_size = PAGE_SIZE;
+ }
+
+ if (hdr_size) { /* queue_type is VTNET_TQ or VTNET_CQ */
+ mz = rte_memzone_reserve_aligned(vq_name, hdr_size, socket_id,
+ 0, RTE_CACHE_LINE_SIZE);
+ if (!mz) {
if (rte_errno == EEXIST)
- vq->virtio_net_hdr_mz =
- rte_memzone_lookup(vq_name);
- if (vq->virtio_net_hdr_mz == NULL) {
+ mz = rte_memzone_lookup(vq_name);
+ if (!mz) {
rte_free(vq);
return -ENOMEM;
}
}
- vq->virtio_net_hdr_mem =
- vq->virtio_net_hdr_mz->phys_addr;
- memset(vq->virtio_net_hdr_mz->addr, 0, PAGE_SIZE);
+ vq->virtio_net_hdr_mz = mz;
+ vq->virtio_net_hdr_vaddr = mz->addr;
+ memset(vq->virtio_net_hdr_vaddr, 0, hdr_size);
+
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->virtio_net_hdr_mem = mz->phys_addr;
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->virtio_net_hdr_mem = (phys_addr_t)mz->addr;
+#endif
}

hw->vtpci_ops->setup_queue(hw, vq);

+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ vq->offset = offsetof(struct rte_mbuf, buf_physaddr);
+#ifdef RTE_VIRTIO_VDEV
+ else
+ vq->offset = offsetof(struct rte_mbuf, buf_addr);
+#endif
+
*pvq = vq;
return 0;
}
@@ -479,8 +495,10 @@ virtio_dev_close(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "virtio_dev_close");

/* reset the NIC */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
+ }
vtpci_reset(hw);
hw->started = 0;
virtio_dev_free_mbufs(dev);
@@ -983,8 +1001,9 @@ virtio_interrupt_handler(__rte_unused struct rte_intr_handle *handle,
isr = vtpci_isr(hw);
PMD_DRV_LOG(INFO, "interrupt status = %#x", isr);

- if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
- PMD_DRV_LOG(ERR, "interrupt enable failed");
+ if (dev->dev_type == RTE_ETH_DEV_PCI)
+ if (rte_intr_enable(&dev->pci_dev->intr_handle) < 0)
+ PMD_DRV_LOG(ERR, "interrupt enable failed");

if (isr & VIRTIO_PCI_ISR_CONFIG) {
if (virtio_dev_link_update(dev, 0) == 0)
@@ -1037,8 +1056,9 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)

pci_dev = eth_dev->pci_dev;

- if (vtpci_init(pci_dev, hw) < 0)
- return -1;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI)
+ if (vtpci_init(pci_dev, hw) < 0)
+ return -1;

/* Reset the device although not necessary at startup */
vtpci_reset(hw);
@@ -1052,10 +1072,12 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
return -1;

/* If host does not support status then disable LSC */
- if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
- pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
+ pci_dev->driver->drv_flags &= ~RTE_PCI_DRV_INTR_LSC;

- rte_eth_copy_pci_info(eth_dev, pci_dev);
+ rte_eth_copy_pci_info(eth_dev, pci_dev);
+ }

rx_func_get(eth_dev);

@@ -1132,15 +1154,17 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)

PMD_INIT_LOG(DEBUG, "hw->max_rx_queues=%d hw->max_tx_queues=%d",
hw->max_rx_queues, hw->max_tx_queues);
- PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
- eth_dev->data->port_id, pci_dev->id.vendor_id,
- pci_dev->id.device_id);
-
- /* Setup interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_register(&pci_dev->intr_handle,
- virtio_interrupt_handler, eth_dev);
-
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI) {
+ PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
+ eth_dev->data->port_id, pci_dev->id.vendor_id,
+ pci_dev->id.device_id);
+
+ /* Setup interrupt callback */
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_register(&pci_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev);
+ }
virtio_dev_cq_start(eth_dev);

return 0;
@@ -1173,10 +1197,11 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
eth_dev->data->mac_addrs = NULL;

/* reset interrupt callback */
- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- rte_intr_callback_unregister(&pci_dev->intr_handle,
- virtio_interrupt_handler,
- eth_dev);
+ if (eth_dev->dev_type == RTE_ETH_DEV_PCI)
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ rte_intr_callback_unregister(&pci_dev->intr_handle,
+ virtio_interrupt_handler,
+ eth_dev);
rte_eal_pci_unmap_device(pci_dev);

PMD_INIT_LOG(DEBUG, "dev_uninit completed");
@@ -1241,11 +1266,13 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return -ENOTSUP;
}

- if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
- if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
- PMD_DRV_LOG(ERR, "failed to set config vector");
- return -EBUSY;
- }
+ if (dev->dev_type == RTE_ETH_DEV_PCI) {
+ if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC)
+ if (vtpci_irq_config(hw, 0) == VIRTIO_MSI_NO_VECTOR) {
+ PMD_DRV_LOG(ERR, "failed to set config vector");
+ return -EBUSY;
+ }
+ }

return 0;
}
@@ -1439,3 +1466,167 @@ static struct rte_driver rte_virtio_driver = {
};

PMD_REGISTER_DRIVER(rte_virtio_driver);
+
+#ifdef RTE_VIRTIO_VDEV
+
+static const char *valid_args[] = {
+#define ETH_CVIO_ARG_RX_NUM "rx"
+ ETH_CVIO_ARG_RX_NUM,
+#define ETH_CVIO_ARG_TX_NUM "tx"
+ ETH_CVIO_ARG_TX_NUM,
+#define ETH_CVIO_ARG_CQ_NUM "cq"
+ ETH_CVIO_ARG_CQ_NUM,
+#define ETH_CVIO_ARG_MAC "mac"
+ ETH_CVIO_ARG_MAC,
+#define ETH_CVIO_ARG_PATH "path"
+ ETH_CVIO_ARG_PATH,
+#define ETH_CVIO_ARG_QUEUE_SIZE "queue_num"
+ ETH_CVIO_ARG_QUEUE_SIZE,
+#define ETH_CVIO_ARG_IFNAME "ifname"
+ ETH_CVIO_ARG_IFNAME,
+ NULL
+};
+
+static int
+get_string_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ if (!value || !extra_args)
+ return -EINVAL;
+
+ *(char **)extra_args = strdup(value);
+
+ return 0;
+}
+
+static int
+get_integer_arg(const char *key __rte_unused,
+ const char *value, void *extra_args)
+{
+ if (!value || !extra_args)
+ return -EINVAL;
+
+ *(uint64_t *)extra_args = strtoull(value, NULL, 0);
+
+ return 0;
+}
+
+static struct rte_eth_dev *
+cvio_eth_dev_alloc(const char *name)
+{
+ struct rte_eth_dev *eth_dev;
+ struct rte_eth_dev_data *data;
+ struct virtio_hw *hw;
+
+ eth_dev = rte_eth_dev_allocate(name, RTE_ETH_DEV_VIRTUAL);
+ if (!eth_dev)
+ rte_panic("cannot alloc rte_eth_dev\n");
+
+ data = eth_dev->data;
+
+ hw = rte_zmalloc(NULL, sizeof(*hw), 0);
+ if (!hw)
+ rte_panic("malloc virtio_hw failed\n");
+
+ data->dev_private = hw;
+ data->numa_node = SOCKET_ID_ANY;
+ eth_dev->pci_dev = NULL;
+ /* will be used in virtio_dev_info_get() */
+ eth_dev->driver = &rte_virtio_pmd;
+ /* TODO: eth_dev->link_intr_cbs */
+ return eth_dev;
+}
+
+#define CVIO_DEF_CQ_EN 0
+#define CVIO_DEF_Q_NUM 1
+#define CVIO_DEF_Q_SZ 256
+/* Dev initialization routine. Invoked once for each virtio vdev at
+ * EAL init time, see rte_eal_dev_init().
+ * Returns 0 on success.
+ */
+static int
+rte_cvio_pmd_devinit(const char *name, const char *params)
+{
+ struct rte_kvargs *kvlist = NULL;
+ struct rte_eth_dev *eth_dev = NULL;
+ uint64_t nb_rx = CVIO_DEF_Q_NUM;
+ uint64_t nb_tx = CVIO_DEF_Q_NUM;
+ uint64_t nb_cq = CVIO_DEF_CQ_EN;
+ uint64_t queue_num = CVIO_DEF_Q_SZ;
+ char *sock_path = NULL;
+ char *mac_addr = NULL;
+ char *ifname = NULL;
+
+ if (!params || params[0] == '\0')
+ rte_panic("arg %s is mandatory for eth_cvio\n",
+ ETH_CVIO_ARG_QUEUE_SIZE);
+
+ kvlist = rte_kvargs_parse(params, valid_args);
+ if (!kvlist)
+ rte_panic("error when parsing param\n");
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_PATH) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_PATH,
+ &get_string_arg, &sock_path);
+ else
+ rte_panic("arg %s is mandatory for eth_cvio\n",
+ ETH_CVIO_ARG_QUEUE_SIZE);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_MAC) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_MAC,
+ &get_string_arg, &mac_addr);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_IFNAME) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_IFNAME,
+ &get_string_arg, &ifname);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_QUEUE_SIZE) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_QUEUE_SIZE,
+ &get_integer_arg, &queue_num);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_RX_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_RX_NUM,
+ &get_integer_arg, &nb_rx);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_TX_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_TX_NUM,
+ &get_integer_arg, &nb_tx);
+
+ if (rte_kvargs_count(kvlist, ETH_CVIO_ARG_CQ_NUM) == 1)
+ rte_kvargs_process(kvlist, ETH_CVIO_ARG_CQ_NUM,
+ &get_integer_arg, &nb_cq);
+
+ eth_dev = cvio_eth_dev_alloc(name);
+
+ virtio_vdev_init(eth_dev->data, sock_path, nb_rx, nb_tx, nb_cq,
+ queue_num, mac_addr, ifname);
+ if (sock_path)
+ free(sock_path);
+ if (mac_addr)
+ free(mac_addr);
+ if (ifname)
+ free(ifname);
+
+ /* originally, this will be called in rte_eal_pci_probe() */
+ eth_virtio_dev_init(eth_dev);
+
+ return 0;
+}
+
+static int
+rte_cvio_pmd_devuninit(const char *name)
+{
+ rte_panic("%s", name);
+ return 0;
+}
+
+static struct rte_driver rte_cvio_driver = {
+ .name = "eth_cvio",
+ .type = PMD_VDEV,
+ .init = rte_cvio_pmd_devinit,
+ .uninit = rte_cvio_pmd_devuninit,
+};
+
+PMD_REGISTER_DRIVER(rte_cvio_driver);
+
+#endif
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 41a1366..cebd75a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -191,8 +191,7 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)

start_dp = vq->vq_ring.desc;
start_dp[idx].addr =
- (uint64_t)(cookie->buf_physaddr + RTE_PKTMBUF_HEADROOM
- - hw->vtnet_hdr_size);
+ RTE_MBUF_DATA_DMA_ADDR(cookie, vq->offset) - hw->vtnet_hdr_size;
start_dp[idx].len =
cookie->buf_len - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
start_dp[idx].flags = VRING_DESC_F_WRITE;
@@ -237,7 +236,8 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct rte_mbuf *cookie)

for (; ((seg_num > 0) && (cookie != NULL)); seg_num--) {
idx = start_dp[idx].next;
- start_dp[idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie);
+ start_dp[idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(cookie, txvq->offset);
start_dp[idx].len = cookie->data_len;
start_dp[idx].flags = VRING_DESC_F_NEXT;
cookie = cookie->next;
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index 3a1de9d..92a6388 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -80,8 +80,8 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
vq->sw_ring[desc_idx] = cookie;

start_dp = vq->vq_ring.desc;
- start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - vq->hw->vtnet_hdr_size);
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(cookie, vq->offset)
+ - vq->hw->vtnet_hdr_size;
start_dp[desc_idx].len = cookie->buf_len -
RTE_PKTMBUF_HEADROOM + vq->hw->vtnet_hdr_size;

@@ -119,8 +119,8 @@ virtio_rxq_rearm_vec(struct virtqueue *rxvq)
*(uint64_t *)p = rxvq->mbuf_initializer;

start_dp[i].addr =
- (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
- RTE_PKTMBUF_HEADROOM - rxvq->hw->vtnet_hdr_size);
+ RTE_MBUF_DATA_DMA_ADDR(sw_ring[i], rxvq->offset) -
+ rxvq->hw->vtnet_hdr_size;
start_dp[i].len = sw_ring[i]->buf_len -
RTE_PKTMBUF_HEADROOM + rxvq->hw->vtnet_hdr_size;
}
@@ -366,7 +366,7 @@ virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
for (i = 0; i < nb_tail; i++) {
start_dp[desc_idx].addr =
- RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts, txvq->offset);
start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
tx_pkts++;
desc_idx++;
@@ -377,7 +377,8 @@ virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
for (i = 0; i < nb_commit; i++)
txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
for (i = 0; i < nb_commit; i++) {
- start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts,
+ txvq->offset);
start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
tx_pkts++;
desc_idx++;
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 99d4fa9..057c4ed 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -66,8 +66,14 @@ struct rte_mbuf;

#define VIRTQUEUE_MAX_NAME_SZ 32

-#define RTE_MBUF_DATA_DMA_ADDR(mb) \
- (uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
+#ifdef RTE_VIRTIO_VDEV
+#define RTE_MBUF_DATA_DMA_ADDR(mb, offset) \
+ ((uint64_t)((uintptr_t)(*(void **)((uintptr_t)mb + offset)) \
+ + (mb)->data_off))
+#else
+#define RTE_MBUF_DATA_DMA_ADDR(mb, offset) \
+ ((uint64_t)((mb)->buf_physaddr + (mb)->data_off))
+#endif /* RTE_VIRTIO_VDEV */

#define VTNET_SQ_RQ_QUEUE_IDX 0
#define VTNET_SQ_TQ_QUEUE_IDX 1
@@ -167,7 +173,8 @@ struct virtqueue {

void *vq_ring_virt_mem; /**< linear address of vring*/
unsigned int vq_ring_size;
- phys_addr_t vq_ring_mem; /**< physical address of vring */
+ phys_addr_t vq_ring_mem; /**< phys addr of vring for pci dev */
+ /**< virt addr of vring for vdev. */

struct vring vq_ring; /**< vring keeping desc, used and avail */
uint16_t vq_free_cnt; /**< num of desc available */
@@ -186,8 +193,10 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint16_t offset; /**< relative offset to obtain addr in mbuf */
uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */
+ void *virtio_net_hdr_vaddr; /**< linear address of vring*/

struct rte_mbuf **sw_ring; /**< RX software ring. */
/* dummy mbuf, for wraparound when processing RX ring. */
--
2.1.4
Jianfeng Tan
2016-02-05 11:20:28 UTC
Permalink
Signed-off-by: Huawei Xie <***@intel.com>
Signed-off-by: Jianfeng Tan <***@intel.com>
---
doc/guides/rel_notes/release_2_3.rst | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_3.rst b/doc/guides/rel_notes/release_2_3.rst
index 7945694..1e7d51d 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -39,6 +39,10 @@ This section should contain new features added in this release. Sample format:

Enabled virtio 1.0 support for virtio pmd driver.

+* **Virtio support for containers.**
+
+ Add a new virtual device, named eth_cvio, to support virtio for containers.
+

Resolved Issues
---------------
--
2.1.4
Loading...