[dpdk-dev] [RFC 0/3] tqs: add thread quiescent state library

Discussion:

Honnappa Nagarahalli

2018-11-22 03:30:52 UTC

Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).

In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.

Since these data structures are lock less, the writers and readers
are accessing the data structures simultaneously. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
separate the operation of removing an element into 2 steps:

Delete: in this step, the writer removes the element from the
data structure but does not return the associated memory to the allocator.
This will ensure that new readers will not get a reference to the removed
element. Removing the reference is an atomic operation.

Free: in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the removed element.

This library helps the writer determine when it is safe to free the
memory.

This library makes use of Thread Quiescent State (TQS). TQS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
determine its quiescent state. Let us consider the following diagram:

Time -------------------------------------------------->

| |
RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$
| |
RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$
| |
RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$
| |
|<--->|
Del | Free
|
Cannot free memory
during this period

RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
atomic operation.
Free - Point in time when the writer can free the entry.

As shown thread RT1 acesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot return the memory associated with that element to the
allocator. The writer can return the memory to the allocator only after
the reader stops referencng D2. In other words, reader thread RT1
has to enter a quiescent state.

Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.

However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.

It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.

For DPDK applications, the start and end of while(1) loop (where no shared
data structures are getting accessed) act as perfect quiescent states. This
will combine all the shared data structure accesses into a single critical
section and keeps the over head introduced by this library to the minimum.

However, the length of the critical section and the number of reader threads
is proportional to the time taken to identify the end of critical section.
So, if the application desires, it should be possible to identify the end
of critical section for each data structure.

To provide the required flexibility, this library has a concept of TQS
variable. The application can create one or more TQS variables to help it
track the end of one or more critical sections.

The application can create a TQS variable using the API rte_tqs_alloc.
It takes a mask of lcore IDs that will report their quiescent states
using this variable. This mask can be empty to start with.

rte_tqs_register_lcore API will register a reader thread to report its
quiescent state. This can be called from any control plane thread or from
the reader thread. The application can create a TQS variable with no reader
threads and add the threads dynamically using this API.

The application can trigger the reader threads to report their quiescent
state status by calling the API rte_tqs_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_tqs_start returns a token to each caller.

The application has to call rte_tqs_check API with the token to get the
current status. Option to block till all the threads enter the quiescent
state is provided. If this API indicates that all the threads have entered
the quiescent state, the application can free the deleted entry.

The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of waiting for the
reader threads to enter the quiescent state.

rte_tqs_unregister_lcore API will remove a reader thread from reporting its
quiescent state using a TQS variable. The rte_tqs_check API will not wait
for this reader thread to report the quiescent state status anymore.

Finally, a TQS variable can be deleted by calling rte_tqs_free API.
Application must make sure that the reader threads are not referencing the
TQS variable anymore before deleting it.

The reader threads should call rte_tqs_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.

Next Steps:
1) Add more test cases
2) Convert to patch
3) Incorporate feedback from community
4) Add documentation

Dharmik Thakkar (1):
test/tqs: Add API and functional tests

Honnappa Nagarahalli (2):
log: add TQS log type
tqs: add thread quiescent state library

config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_eal/common/include/rte_log.h | 1 +
lib/librte_tqs/Makefile | 23 +
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 +++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 +
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 ++++++++++++++++++++++++
14 files changed, 1208 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map
create mode 100644 test/test/test_tqs.c

--
2.17.1

Honnappa Nagarahalli

2018-11-22 03:30:53 UTC

Permalink

Signed-off-by: Honnappa Nagarahalli <***@arm.com>
Reviewed-by: Steve Capper <***@arm.com>
Reviewed-by: Gavin Hu <***@arm.com>
---
lib/librte_eal/common/include/rte_log.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/lib/librte_eal/common/include/rte_log.h b/lib/librte_eal/common/include/rte_log.h
index 2f789cb90..b4e91a4a5 100644
--- a/lib/librte_eal/common/include/rte_log.h
+++ b/lib/librte_eal/common/include/rte_log.h
@@ -61,6 +61,7 @@ extern struct rte_logs rte_logs;
#define RTE_LOGTYPE_EFD 18 /**< Log related to EFD. */
#define RTE_LOGTYPE_EVENTDEV 19 /**< Log related to eventdev. */
#define RTE_LOGTYPE_GSO 20 /**< Log related to GSO. */
+#define RTE_LOGTYPE_TQS 21 /**< Log related to Thread Quiescent State. */

/* these log types can be used in an application */
#define RTE_LOGTYPE_USER1 24 /**< User-defined log type 1. */

--
2.17.1

Stephen Hemminger

2018-11-27 22:24:15 UTC

Permalink

On Wed, 21 Nov 2018 21:30:53 -0600

Post by Honnappa Nagarahalli
---
lib/librte_eal/common/include/rte_log.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/librte_eal/common/include/rte_log.h b/lib/librte_eal/common/include/rte_log.h
index 2f789cb90..b4e91a4a5 100644
--- a/lib/librte_eal/common/include/rte_log.h
+++ b/lib/librte_eal/common/include/rte_log.h
@@ -61,6 +61,7 @@ extern struct rte_logs rte_logs;
#define RTE_LOGTYPE_EFD 18 /**< Log related to EFD. */
#define RTE_LOGTYPE_EVENTDEV 19 /**< Log related to eventdev. */
#define RTE_LOGTYPE_GSO 20 /**< Log related to GSO. */
+#define RTE_LOGTYPE_TQS 21 /**< Log related to Thread Quiescent State. */
/* these log types can be used in an application */
#define RTE_LOGTYPE_USER1 24 /**< User-defined log type 1. */

Sorry, I don't think this is the right way now.

All new logging should be using dynamic log types.
We should work on getting rid of others (EFD, EVENTDEV, GSO).

Honnappa Nagarahalli

2018-11-28 05:58:58 UTC

Permalink

Post by Stephen Hemminger
On Wed, 21 Nov 2018 21:30:53 -0600

Post by Honnappa Nagarahalli
---
lib/librte_eal/common/include/rte_log.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/librte_eal/common/include/rte_log.h

b/lib/librte_eal/common/include/rte_log.h

Post by Honnappa Nagarahalli
index 2f789cb90..b4e91a4a5 100644
--- a/lib/librte_eal/common/include/rte_log.h
+++ b/lib/librte_eal/common/include/rte_log.h
@@ -61,6 +61,7 @@ extern struct rte_logs rte_logs;
#define RTE_LOGTYPE_EFD 18 /**< Log related to EFD. */
#define RTE_LOGTYPE_EVENTDEV 19 /**< Log related to eventdev. */
#define RTE_LOGTYPE_GSO 20 /**< Log related to GSO. */
+#define RTE_LOGTYPE_TQS 21 /**< Log related to Thread Quiescent State.

Post by Honnappa Nagarahalli
/* these log types can be used in an application */
#define RTE_LOGTYPE_USER1 24 /**< User-defined log type 1. */

Sorry, I don't think this is the right way now.

Ok. I see some examples for the libraries already. I will change it in next version.

Post by Stephen Hemminger
All new logging should be using dynamic log types.
We should work on getting rid of others (EFD, EVENTDEV, GSO).

Honnappa Nagarahalli

2018-11-22 03:30:54 UTC

Permalink

Add Thread Quiescent State (TQS) library. This library helps identify
the quiescent state of the reader threads so that the writers
can free the memory associated with the lock less data structures.

Signed-off-by: Honnappa Nagarahalli <***@arm.com>
Reviewed-by: Steve Capper <***@arm.com>
Reviewed-by: Gavin Hu <***@arm.com>
---
config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_tqs/Makefile | 23 ++
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 ++++++++++++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 ++
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
9 files changed, 655 insertions(+), 1 deletion(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map

diff --git a/config/common_base b/config/common_base
index d12ae98bc..af40a9f81 100644
--- a/config/common_base
+++ b/config/common_base
@@ -792,6 +792,12 @@ CONFIG_RTE_LIBRTE_LATENCY_STATS=y
#
CONFIG_RTE_LIBRTE_TELEMETRY=n

+#
+# Compile librte_tqs
+#
+CONFIG_RTE_LIBRTE_TQS=y
+CONFIG_RTE_LIBRTE_TQS_DEBUG=n
+
#
# Compile librte_lpm
#
diff --git a/lib/Makefile b/lib/Makefile
index b7370ef97..7095eac88 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -108,6 +108,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_BPF) += librte_bpf
DEPDIRS-librte_bpf := librte_eal librte_mempool librte_mbuf librte_ethdev
DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry
DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev
+DIRS-$(CONFIG_RTE_LIBRTE_TQS) += librte_tqs
+DEPDIRS-librte_tqs := librte_eal

ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni
diff --git a/lib/librte_tqs/Makefile b/lib/librte_tqs/Makefile
new file mode 100644
index 000000000..059de53e2
--- /dev/null
+++ b/lib/librte_tqs/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Arm Limited
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_tqs.a
+
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+LDLIBS += -lrte_eal
+
+EXPORT_MAP := rte_tqs_version.map
+
+LIBABIVER := 1
+
+# all source are stored in SRCS-y
+SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_tqs.c
+
+# install includes
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_tqs.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_tqs/meson.build b/lib/librte_tqs/meson.build
new file mode 100644
index 000000000..dd696ab07
--- /dev/null
+++ b/lib/librte_tqs/meson.build
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Arm Limited
+
+sources = files('rte_tqs.c')
+headers = files('rte_tqs.h')
diff --git a/lib/librte_tqs/rte_tqs.c b/lib/librte_tqs/rte_tqs.c
new file mode 100644
index 000000000..cbc36864e
--- /dev/null
+++ b/lib/librte_tqs/rte_tqs.c
@@ -0,0 +1,249 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <errno.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_malloc.h>
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_errno.h>
+
+#include "rte_tqs.h"
+
+TAILQ_HEAD(rte_tqs_list, rte_tailq_entry);
+
+static struct rte_tailq_elem rte_tqs_tailq = {
+ .name = RTE_TAILQ_TQS_NAME,
+};
+EAL_REGISTER_TAILQ(rte_tqs_tailq)
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental
+rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask)
+{
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS variable\n");
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+tqs_exist:
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+alloc_error:
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}
+
+/* De-allocate all the memory used by a TQS variable. */
+void __rte_experimental
+rte_tqs_free(struct rte_tqs *v)
+{
+ struct rte_tqs_list *tqs_list;
+ struct rte_tailq_entry *te;
+
+ /* Search for the TQS variable in tailq */
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ TAILQ_FOREACH(te, tqs_list, next) {
+ if (te->data == (void *) v)
+ break;
+ }
+
+ if (te != NULL)
+ TAILQ_REMOVE(tqs_list, te, next);
+ else
+ RTE_LOG(ERR, TQS, "TQS variable %s not found\n", v->name);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+}
+
+/* Add a reader thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >= RTE_TQS_MAX_LCORE),
+ -EINVAL);
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/* Remove a reader thread, running on an lcore, from the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_unregister_lcore(struct rte_tqs *v, unsigned int lcore_id)
+{
+ TQS_RETURN_IF_TRUE((v == NULL ||
+ lcore_id >= RTE_TQS_MAX_LCORE), -EINVAL);
+
+ /* This can be a relaxed store. Since this is an API, make sure
+ * the store is not reordered with other memory operations.
+ */
+ __atomic_fetch_and(&v->lcore_mask,
+ ~(1UL << lcore_id), __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/* Search a TQS variable, given its name. */
+struct rte_tqs * __rte_experimental
+rte_tqs_lookup(const char *name)
+{
+ struct rte_tqs_list *tqs_list;
+ struct rte_tailq_entry *te;
+ struct rte_tqs *v;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ v = NULL;
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ rte_rwlock_read_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ /* find out tailq entry */
+ TAILQ_FOREACH(te, tqs_list, next) {
+ v = (struct rte_tqs *) te->data;
+ if (strncmp(name, v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (te == NULL) {
+ rte_errno = -ENOENT;
+ v = NULL;
+ }
+
+ rte_rwlock_read_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+}
+
+/* Dump the details of a single TQS variables to a file. */
+void __rte_experimental
+rte_tqs_dump(FILE *f, struct rte_tqs *v)
+{
+ uint64_t tmp_mask;
+ uint32_t i;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || f == NULL);
+
+ fprintf(f, "\nTQS <%s>@%p\n", v->name, v);
+ fprintf(f, " lcore mask = 0x%lx\n", v->lcore_mask);
+ fprintf(f, " token = %u\n", v->token);
+
+ if (v->lcore_mask != 0) {
+ fprintf(f, "Quiescent State Counts for readers:\n");
+ tmp_mask = v->lcore_mask;
+ while (tmp_mask) {
+ i = __builtin_ctz(tmp_mask);
+ fprintf(f, "lcore # = %d, count = %u\n",
+ i, v->w[i].cnt);
+ tmp_mask &= ~(1UL << i);
+ }
+ }
+}
+
+/* Dump the details of all the TQS variables to a file. */
+void __rte_experimental
+rte_tqs_list_dump(FILE *f)
+{
+ const struct rte_tailq_entry *te;
+ struct rte_tqs_list *tqs_list;
+
+ TQS_ERR_LOG_IF_TRUE(f == NULL);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ rte_rwlock_read_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ TAILQ_FOREACH(te, tqs_list, next) {
+ rte_tqs_dump(f, (struct rte_tqs *) te->data);
+ }
+
+ rte_rwlock_read_unlock(RTE_EAL_TAILQ_RWLOCK);
+}
diff --git a/lib/librte_tqs/rte_tqs.h b/lib/librte_tqs/rte_tqs.h
new file mode 100644
index 000000000..9136418d2
--- /dev/null
+++ b/lib/librte_tqs/rte_tqs.h
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#ifndef _RTE_TQS_H_
+#define _RTE_TQS_H_
+
+/**
+ * @file
+ * RTE Thread Quiescent State
+ *
+ * Thread Quiescent State (TQS) is any point in the thread execution
+ * where the thread does not hold a reference to shared memory, i.e.
+ * a non-critical section. A critical section for a data structure can
+ * be a quiescent state for another data structure.
+ *
+ * An application can identify the quiescent state according to its
+ * needs. It can identify 1 quiescent state for each data structure or
+ * 1 quiescent state for a group/all of data structures.
+ *
+ * This library provides the flexibility for these use cases.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+
+#define RTE_TAILQ_TQS_NAME "RTE_TQS"
+
+/**< The maximum length of a TQS variable name. */
+#define RTE_TQS_NAMESIZE 32
+
+/**< Maximum number of lcores supported. */
+#if (RTE_MAX_LCORE > 64)
+#define RTE_TQS_MAX_LCORE 64
+#else
+#define RTE_TQS_MAX_LCORE RTE_MAX_LCORE
+#endif
+
+/* Macro for run-time checking of function parameters */
+#if defined(RTE_LIBRTE_TQS_DEBUG)
+#define TQS_RETURN_IF_TRUE(cond, retval) do { \
+ if ((cond)) \
+ return retval; \
+} while (0)
+#else
+#define TQS_RETURN_IF_TRUE(cond, retval)
+#endif
+
+/* Macro to log error message */
+#define TQS_ERR_LOG_IF_TRUE(cond) do { \
+ if ((cond)) { \
+ RTE_LOG(ERR, TQS, "Invalid parameters\n"); \
+ return; \
+ } \
+} while (0)
+
+/* Worker thread counter */
+struct rte_tqs_cnt {
+ volatile uint32_t cnt; /**< Quiescent state counter. */
+} __rte_cache_aligned;
+
+/**
+ * RTE Thread Quiescent State structure.
+ */
+struct rte_tqs {
+ char name[RTE_TQS_NAMESIZE];
+ /**< Name of the ring. */
+ uint64_t lcore_mask;
+ /**< Worker lcores reporting on this TQS */
+
+ uint32_t token __rte_cache_aligned;
+ /**< Counter to allow for multiple simultaneous TQS queries */
+
+ struct rte_tqs_cnt w[RTE_TQS_MAX_LCORE] __rte_cache_aligned;
+ /**< TQS counter for each worker thread, counts upto
+ * current value of token.
+ */
+} __rte_cache_aligned;
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Allocate a new TQS variable with the name *name* in memory.
+ *
+ * The TQS variable is added in RTE_TAILQ_TQS list.
+ *
+ * @param name
+ * The name of the TQS variable.
+ * @param socket_id
+ * The *socket_id* argument is the socket identifier in case of
+ * NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ * constraint for the reserved zone.
+ * @param lcore_mask
+ * Data plane reader threads in this mask will report their quiescent
+ * state on this TQS variable.
+ * @return
+ * On success, the pointer to the new allocated TQS variable. NULL on
+ * error with rte_errno set appropriately. Possible errno values include:
+ * - EINVAL - invalid input parameters
+ * - ENAMETOOLONG - TQS variable name is longer than RTE_TQS_NAMESIZE
+ * - EEXIST - a TQS variable with the same name already exists
+ * - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+struct rte_tqs * __rte_experimental
+rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * De-allocate all the memory used by a TQS variable. It is the
+ * application's responsibility to make sure that no other thread
+ * is using the TQS variable.
+ *
+ * The TQS variable is removed from RTE_TAILQ_TQS list.
+ *
+ * @param v
+ * TQS variable to free.
+ */
+void __rte_experimental rte_tqs_free(struct rte_tqs *v);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a worker thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe. This API can be called from the worker threads during
+ * initialization. Any ongoing TQS queries may wait for the
+ * status from this registered worker thread.
+ *
+ * @param v
+ * TQS variable
+ * @param lcore_id
+ * Worker thread on this lcore will report its quiescent state on
+ * this TQS variable.
+ * @return
+ * - 0 if registered successfully.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove a worker thread, running on an lcore, from the list of threads
+ * reporting their quiescent state on a TQS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * This API can be called from the worker threads during shutdown.
+ * Any ongoing TQS queries may stop waiting for the status from this
+ * unregistered worker thread.
+ *
+ * @param v
+ * TQS variable
+ * @param lcore_id
+ * Worker thread on this lcore will stop reporting its quiescent state
+ * on this TQS variable.
+ * @return
+ * - 0 if un-registered successfully.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+int __rte_experimental
+rte_tqs_unregister_lcore(struct rte_tqs *v, unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * @param v
+ * TQS variable
+ * @param n
+ * Expected number of times the quiescent state is entered
+ * @param t
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * @return
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * @param v
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id)
+{
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >= RTE_TQS_MAX_LCORE);
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * @param v
+ * TQS variable
+ * @param t
+ * Token returned by rte_tqs_start API
+ * @param wait
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * @return
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait)
+{
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask, __ATOMIC_ACQUIRE);
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Search a TQS variable, given its name.
+ *
+ * It is multi-thread safe.
+ *
+ * @param name
+ * The name of the TQS variable.
+ * @return
+ * On success, the pointer to the TQS variable. NULL on
+ * error with rte_errno set appropriately. Possible errno values include:
+ * - EINVAL - invalid input parameters.
+ * - ENOENT - entry not found.
+ */
+struct rte_tqs * __rte_experimental
+rte_tqs_lookup(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Dump the details of a single TQS variables to a file.
+ *
+ * It is NOT multi-thread safe.
+ *
+ * @param f
+ * A pointer to a file for output
+ * @param tqs
+ * TQS variable
+ */
+void __rte_experimental
+rte_tqs_dump(FILE *f, struct rte_tqs *v);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Dump the details of all the TQS variables to a file.
+ *
+ * It is multi-thread safe.
+ *
+ * @param f
+ * A pointer to a file for output
+ */
+void __rte_experimental
+rte_tqs_list_dump(FILE *f);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_TQS_H_ */
diff --git a/lib/librte_tqs/rte_tqs_version.map b/lib/librte_tqs/rte_tqs_version.map
new file mode 100644
index 000000000..2e4d5c094
--- /dev/null
+++ b/lib/librte_tqs/rte_tqs_version.map
@@ -0,0 +1,16 @@
+EXPERIMENTAL {
+ global:
+
+ rte_tqs_alloc;
+ rte_tqs_free;
+ rte_tqs_register_lcore;
+ rte_tqs_unregister_lcore;
+ rte_tqs_start;
+ rte_tqs_update;
+ rte_tqs_check;
+ rte_tqs_lookup;
+ rte_tqs_list_dump;
+ rte_tqs_dump;
+
+ local: *;
+};
diff --git a/lib/meson.build b/lib/meson.build
index bb7f443f9..ee19c483e 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -21,7 +21,7 @@ libraries = [ 'compat', # just a header, used for versioning
'gro', 'gso', 'ip_frag', 'jobstats',
'kni', 'latencystats', 'lpm', 'member',
'meter', 'power', 'pdump', 'rawdev',
- 'reorder', 'sched', 'security', 'vhost',
+ 'reorder', 'sched', 'security', 'tqs', 'vhost',
# add pkt framework libs which use other libs from above
'port', 'table', 'pipeline',
# flow_classify lib depends on pkt framework table lib
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3ebc4e64c..6e19e669a 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -92,6 +92,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lrte_eal
_LDLIBS-$(CONFIG_RTE_LIBRTE_CMDLINE) += -lrte_cmdline
_LDLIBS-$(CONFIG_RTE_LIBRTE_REORDER) += -lrte_reorder
_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lrte_sched
+_LDLIBS-$(CONFIG_RTE_LIBRTE_TQS) += -lrte_tqs

ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI) += -lrte_kni

--
2.17.1

Ananyev, Konstantin

2018-11-24 12:18:00 UTC

Permalink

Hi Honnappa,

Post by Honnappa Nagarahalli
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental
+rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask)
+{
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS variable\n");
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}

That seems quite heavy-weight function just to allocate sync variable.
As size of struct rte_tqs is constant and known to the user, might be better
just provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free
memory for it by himself.

Post by Honnappa Nagarahalli
+
+/* Add a reader thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >= RTE_TQS_MAX_LCORE),
+ -EINVAL);

It is not very good practice to make function return different values and behave
in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to check input parameters.
For fast-path (functions in .h) we sometimes skip such checking,
but debug mode can probably use RTE_ASSERT() or so.

lcore_id >= RTE_TQS_MAX_LCORE

Is this limitation really necessary?
First it means that only lcores can use that API (at least data-path part), second
even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of requiring lcore_id as
input parameter, you'll just make it return index of next available entry in w[].
Then tqs_update() can take that index as input parameter.

Post by Honnappa Nagarahalli
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;

Wonder what would happen, if new reader will
call register(), after writer calls start()?
Looks like a race-condition.
Or such pattern is not supported?

Post by Honnappa Nagarahalli
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Expected number of times the quiescent state is entered
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id)
+{
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >= RTE_TQS_MAX_LCORE);
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);

Hmm, I am not very familiar with C11 model, but it looks like a race
condition to me:
as I understand, update() supposed be called at the end of reader's
critical section, correct?
But ACQUIRE is only a hoist barrier, which means compiler and cpu
are free to move earlier reads (and writes) after it.
It probably needs to be a full ACQ_REL here.

Post by Honnappa Nagarahalli
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Token returned by rte_tqs_start API
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait)
+{
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask, __ATOMIC_ACQUIRE);

What would happen if reader will call unregister() simultaneously
with check() and will update lcore_mask straight after that load?
As I understand check() might hang in such case.

Post by Honnappa Nagarahalli
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;

As I understand, that makes control-path function progress dependent
on simultaneous invocation of data-path functions.
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes
control-path and data-path functions from the same thread.

Post by Honnappa Nagarahalli
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+

Honnappa Nagarahalli

2018-11-27 21:32:04 UTC

Permalink

Post by Ananyev, Konstantin
Hi Honnappa,

Thank you for reviewing the patch, appreciate your comments.

Post by Ananyev, Konstantin

Post by Honnappa Nagarahalli
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name,
+int socket_id, uint64_t lcore_mask) {
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS

variable\n");

Post by Honnappa Nagarahalli
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}

That seems quite heavy-weight function just to allocate sync variable.
As size of struct rte_tqs is constant and known to the user, might be better just
provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free memory
for it by himself.

I believe, when you say heavy-weight, you are referring to adding tqs variable to the TAILQ and allocating the memory for it. Agree. I also do not expect that there are a whole lot of tqs variables used in an application. Even in rte_tqs_free, there is similar overhead.

The extra part is due to the way the TQS variable will get identified by data plane threads. I am thinking that a data plane thread will use the rte_tqs_lookup API to identify a TQS variable. However, it is possible to share this with data plane threads via a simple shared structure as well.

Along with not allocating the memory, are you suggesting that we could skip maintaining a list of TQS variables in the TAILQ? This will remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach.

Post by Ananyev, Konstantin

RTE_TQS_MAX_LCORE),

Post by Honnappa Nagarahalli
+ -EINVAL);

Makes sense, I will change this in the next version.

Post by Ananyev, Konstantin
lcore_id >= RTE_TQS_MAX_LCORE
Is this limitation really necessary?

I added this limitation because currently DPDK application cannot take a mask more than 64bit wide. Otherwise, this should be as big as RTE_MAX_LCORE.
I see that in the case of '-lcores' option, the number of lcores can be more than the number of PEs. In this case, we still need a MAX limit (but can be bigger than 64).

Post by Ananyev, Konstantin
First it means that only lcores can use that API (at least data-path part), second
even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of requiring lcore_id as
input parameter, you'll just make it return index of next available entry in w[].
Then tqs_update() can take that index as input parameter.

I had thought about a similar approach based on IDs. I was concerned that ID will be one more thing to manage for the application. But, I see the limitations of the current approach now. I will change it to allocation based. This will support even non-EAL pthreads as well.

Post by Ananyev, Konstantin

Post by Honnappa Nagarahalli
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;

Wonder what would happen, if new reader will call register(), after writer calls
start()?
Looks like a race-condition.
Or such pattern is not supported?

The start should be called only after the reference to the entry in the data structure is 'deleted'. Hence the new reader will not get the reference to the deleted entry and does not have to increment its counter. When rte_tqs_check is called, it will see that the counter is already up to date. (I am missing a load-acquire on the token, I will correct that in the next version).

Post by Ananyev, Konstantin

Post by Honnappa Nagarahalli
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id),
+__ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Expected number of times the quiescent state is entered
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) {
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) {
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >=

RTE_TQS_MAX_LCORE);

Post by Honnappa Nagarahalli
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);

Hmm, I am not very familiar with C11 model, but it looks like a race condition
as I understand, update() supposed be called at the end of reader's critical
section, correct?

Yes, the understanding is correct.

Post by Ananyev, Konstantin
But ACQUIRE is only a hoist barrier, which means compiler and cpu are free to
move earlier reads (and writes) after it.

Yes, your understanding is correct.

Post by Ananyev, Konstantin
It probably needs to be a full ACQ_REL here.

The sequence of operations is as follows:
1) Writer 'deletes' an entry from a lock-free data structure
2) Writer calls rte_tqs_start - This API increments the 'token' and does a store-release. So, any earlier stores would be visible if the store to 'token' is visible (to the data plane threads).
3) Reader calls rte_tqs_update - This API load-acquires the 'token'.
a) If this 'token' is the updated value from 2) then the entry deleted from 1) will not be available for the reader to reference (even if that reference is due to earlier reads being moved after load-acquire of 'token').
b) If this 'token' is not the updated value from 2) then the entry deleted from 1) may or may not be available for the reader to reference. In this case the w[lcore_id].cnt is not updated, hence the writer will wait to 'free' the deleted entry from 1)

Post by Ananyev, Konstantin

Post by Honnappa Nagarahalli
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Token returned by rte_tqs_start API
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) {
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask,

__ATOMIC_ACQUIRE);
What would happen if reader will call unregister() simultaneously with check()
and will update lcore_mask straight after that load?
As I understand check() might hang in such case.

If the 'lcore_mask' is updated after this load, it will affect only the current iteration of the while loop below. In the next iteration the 'lcore_mask' is loaded again.

Post by Ananyev, Konstantin

Post by Honnappa Nagarahalli
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;

As I understand, that makes control-path function progress dependent on
simultaneous invocation of data-path functions.

I agree that the control-path function progress (for ex: how long to wait for freeing the memory) depends on invocation of the data-path functions. The separation of 'start', 'check' and the option not to block in 'check' provide the flexibility for control-path to do some other work if it chooses to.

Post by Ananyev, Konstantin
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes control-path
and data-path functions from the same thread.

I agree with the case of data-path function not getting called. I would consider that as programming error. I can document that warning in the rte_tqs_check API.

In the case of same thread calling both control-path and data-path functions, it would depend on the sequence of the calls. The following sequence should not cause any hangs:
Worker thread
1) 'deletes' an entry from a lock-free data structure
2) rte_tqs_start
3) rte_tqs_update
4) rte_tqs_check (wait == 1 or wait == 0)
5) 'free' the entry deleted in 1)

If 3) and 4) are interchanged, then there will be a hang if wait is set to 1. If wait is set to 0, there should not be a hang.
I can document this as part of the documentation (I do not think API documentation is required for this).

Post by Ananyev, Konstantin

Post by Honnappa Nagarahalli
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+

Ananyev, Konstantin

2018-11-28 15:25:21 UTC