Discussion:
[PATCH RFC 0/9] socket filtering using nf_tables
(too old to reply)
Alexei Starovoitov
2014-03-11 17:59:51 UTC
Permalink
Hi!
The following patchset provides a socket filtering alternative to BPF
which allows you to define your filter using the nf_tables expressions.
Similarly to BPF, you can attach filters via setsockopt()
SO_ATTACH_NFT_FILTER. The filter that is passed to the kernel is
expression list (nested attribute)
expression element (nested attribute)
expression name (string)
expression data (nested attribute)
... specific attribute for this expression go here
This is similar to the netlink format of the nf_tables rules, so we
can re-use most of the infrastructure that we already have in userspace.
The kernel takes the TLV representation and translates it to the native
nf_tables representation.
The patches 1-3 have helped to generalize the existing socket filtering
infrastructure to allow pluging new socket filtering frameworks. Then,
patches 4-8 generalize the nf_tables code by move the neccessary nf_tables
expression and data initialization core infrastructure. Then, patch 9
provides the nf_tables socket filtering capabilities.
Patrick and I have been discussing for a while that part of this
generalisation works should also help to add support for providing a
replacement to the tc framework, so with the necessary work, nf_tables
may provide in the near future packet a single packet classification
framework for Linux.
I'm being curious here ;) as there's currently an ongoing effort on
netdev for Alexei's eBPF engine (part 1 at [1,2,3]), which addresses
shortcomings of current BPF and shall long term entirely replace the
current BPF engine code to let filters entirely run in eBPF resp.
eBPF's JIT engine, as I understand, which is also transparently usable
in cls_bpf for classification in tc w/o rewriting on a different filter
language. Performance figures have been posted/provided in [1] as well.
So the plan on your side would be to have an alternative to eBPF, or
build on top of it to reuse its in-kernel JIT compiler?
[1] http://patchwork.ozlabs.org/patch/328927/
[2] http://patchwork.ozlabs.org/patch/328926/
[3] http://patchwork.ozlabs.org/patch/328928/
http://people.netfilter.org/pablo/nft-sock-filter-test.c
I'm currently reusing the existing libnftnl interfaces, my plan is to
new interfaces in that library for easier and more simple filter
definition for socket filtering.
Note that the current nf_tables expression-set is also limited with
regards to BPF, but the infrastructure that we have can be easily
extended with new expressions.
Comments welcome!
Hi Pablo,

Could you share what performance you're getting when doing nft
filter equivalent to 'tcpdump port 22' ?
Meaning your filter needs to parse eth->proto, ip or ipv6 header and
check both ports. How will it compare with JITed bpf/ebpf ?

I was trying to go the other way: improve nft performance with ebpf.
10/40G links are way to fast for interpreters. imo JIT is the only way.

here are some comments about patches:
1/9:
- if (fp->bpf_func != sk_run_filter)
- module_free(NULL, fp->bpf_func);
+ if (fp->run_filter != sk_run_filter)
+ module_free(NULL, fp->run_filter);

David suggested that these comparisons in all jits are ugly.
I've fixed it in my patches. When they're in, you wouldn't need to
mess with this.

2/9:
- atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+ atomic_sub(fp->size, &sk->sk_omem_alloc);

that's a big change in socket memory accounting.
We used to account for the whole sk_filter... now you're counting
filter size only.
Is it valid?

7/9:
whole nft_expr_autoload() looks scary from security point of view.
If I'm reading it correctly, the code will do request_module() based on
userspace request to attach filter?

9/9:
+ case SO_NFT_GET_FILTER:
+ len = sk_nft_get_filter(sk, (struct sock_filter __user
*)optval, len);
with my patches there was a concern regarding socket checkpoint/restore
and I had to preserve existing filter image to make sure it's not broken.
Could you please coordinate with Pavel and co to test this piece?

What will happen if nft_filter attached, but so_get_filter is called? crash?

+static int nft_sock_expr_autoload(const struct nft_ctx *ctx,
+ const struct nlattr *nla)
+{
+#ifdef CONFIG_MODULES
+ mutex_unlock(&nft_expr_info_mutex);
+ request_module("nft-expr-%.*s", nla_len(nla), (char *)nla_data(nla));
+ mutex_lock(&nft_expr_info_mutex);

same security concern here...

+int sk_nft_attach_filter(char __user *optval, struct sock *sk)
+{

what about sk_clone_lock()? since filter program is in nft, do you need to do
special steps during copy of socket?

+ fp = sock_kmalloc(sk, sizeof(struct sk_filter) + size, GFP_KERNEL);

this may allocate more memory then you need.
Currently sk_filter_size() computes it in an accurate way.

Also the same issue of optmem accounting as I mentioned in 2/9

+err4:
+ sock_kfree_s(sk, fp, size);

a small bug: allocated sizeof(sk_filter)+size, but freeing 'size' only...

Overall I think it's very interesting work.
Not sure what's the use case for it though.

I'll cook up a patch for the opposite approach (use ebpf inside nft)
and will send you for review.
I would prefer to work together to satisfy your and our user requests.

Thanks
Alexei
net: rename fp->bpf_func to fp->run_filter
net: filter: account filter length in bytes
net: filter: generalise sk_filter_release
netfilter: nf_tables: move fast operations to header
netfilter: nf_tables: add nft_value_init
netfilter: nf_tables: rename nf_tables_core.c to nf_tables_nf.c
netfilter: nf_tables: move expression infrastructure to built-in core
netfilter: nf_tables: generalize verdict handling and introduce scopes
netfilter: nf_tables: add support for socket filtering
arch/arm/net/bpf_jit_32.c | 25 +-
arch/powerpc/net/bpf_jit_comp.c | 10 +-
arch/s390/net/bpf_jit_comp.c | 16 +-
arch/sparc/net/bpf_jit_comp.c | 8 +-
arch/x86/net/bpf_jit_comp.c | 8 +-
include/linux/filter.h | 28 +-
include/net/netfilter/nf_tables.h | 27 +-
include/net/netfilter/nf_tables_core.h | 84 +++++
include/net/netfilter/nft_reject.h | 3 +-
include/net/sock.h | 8 +-
include/uapi/asm-generic/socket.h | 4 +
net/core/filter.c | 28 +-
net/core/sock.c | 19 ++
net/core/sock_diag.c | 4 +-
net/netfilter/Kconfig | 13 +
net/netfilter/Makefile | 9 +-
net/netfilter/nf_tables_api.c | 440 ++++---------------------
net/netfilter/nf_tables_core.c | 564
+++++++++++++++++++++-----------
net/netfilter/nf_tables_nf.c | 189 +++++++++++
net/netfilter/nf_tables_sock.c | 327 ++++++++++++++++++
net/netfilter/nft_bitwise.c | 35 +-
net/netfilter/nft_byteorder.c | 28 +-
net/netfilter/nft_cmp.c | 43 ++-
net/netfilter/nft_compat.c | 6 +-
net/netfilter/nft_counter.c | 3 +-
net/netfilter/nft_ct.c | 9 +-
net/netfilter/nft_exthdr.c | 3 +-
net/netfilter/nft_hash.c | 12 +-
net/netfilter/nft_immediate.c | 35 +-
net/netfilter/nft_limit.c | 3 +-
net/netfilter/nft_log.c | 3 +-
net/netfilter/nft_lookup.c | 3 +-
net/netfilter/nft_meta.c | 51 ++-
net/netfilter/nft_nat.c | 3 +-
net/netfilter/nft_payload.c | 29 +-
net/netfilter/nft_queue.c | 3 +-
net/netfilter/nft_rbtree.c | 12 +-
net/netfilter/nft_reject.c | 3 +-
38 files changed, 1416 insertions(+), 682 deletions(-)
create mode 100644 net/netfilter/nf_tables_nf.c
create mode 100644 net/netfilter/nf_tables_sock.c
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pablo Neira Ayuso
2014-03-12 09:15:27 UTC
Permalink
Hi!

I'm going to reply to Daniel and you in the same email, see below.
Post by Alexei Starovoitov
Hi!
The following patchset provides a socket filtering alternative to BPF
which allows you to define your filter using the nf_tables expressions.
Similarly to BPF, you can attach filters via setsockopt()
SO_ATTACH_NFT_FILTER. The filter that is passed to the kernel is
expression list (nested attribute)
expression element (nested attribute)
expression name (string)
expression data (nested attribute)
... specific attribute for this expression go here
This is similar to the netlink format of the nf_tables rules, so we
can re-use most of the infrastructure that we already have in userspace.
The kernel takes the TLV representation and translates it to the native
nf_tables representation.
The patches 1-3 have helped to generalize the existing socket filtering
infrastructure to allow pluging new socket filtering frameworks. Then,
patches 4-8 generalize the nf_tables code by move the neccessary nf_tables
expression and data initialization core infrastructure. Then, patch 9
provides the nf_tables socket filtering capabilities.
Patrick and I have been discussing for a while that part of this
generalisation works should also help to add support for providing a
replacement to the tc framework, so with the necessary work, nf_tables
may provide in the near future packet a single packet classification
framework for Linux.
I'm being curious here ;) as there's currently an ongoing effort on
netdev for Alexei's eBPF engine (part 1 at [1,2,3]), which addresses
shortcomings of current BPF and shall long term entirely replace the
current BPF engine code to let filters entirely run in eBPF resp.
eBPF's JIT engine, as I understand, which is also transparently usable
in cls_bpf for classification in tc w/o rewriting on a different filter
language. Performance figures have been posted/provided in [1] as well.
So the plan on your side would be to have an alternative to eBPF, or
build on top of it to reuse its in-kernel JIT compiler?
[1] http://patchwork.ozlabs.org/patch/328927/
struct sk_filter
{
atomic_t refcnt;
- unsigned int len; /* Number of filter blocks */
+ /* len - number of insns in sock_filter program
+ * len_ext - number of insns in socket_filter_ext program
+ * jited - true if either original or extended program was
JITed
+ * orig_prog - original sock_filter program if not NULL
+ */
+ unsigned int len;
+ unsigned int len_ext;
+ unsigned int jited:1;
+ struct sock_filter *orig_prog;
struct rcu_head rcu;
- unsigned int (*bpf_func)(const struct sk_buff *skb,
- const struct sock_filter
*filter);
+ union {
+ unsigned int (*bpf_func)(const struct sk_buff *skb,
+ const struct sock_filter *fp);
+ unsigned int (*bpf_func_ext)(const struct sk_buff *skb,
+ const struct sock_filter_ext *fp);
+ };
union {
struct sock_filter insns[0];
+ struct sock_filter_ext insns_ext[0];
struct work_struct work;
};
};

I think we have to generalise this to make it flexible to accomodate
any socket filtering infrastructure. For example, instead of having
bpf_func and bpf_func_ext, I think it would be good to generalise it
that so we pass some void *filter. I also think that other private
information to the filtering approach should be put after the
filtering code, in some variable length area.

This change looks quite ad-hoc. My 1-3 patches were more going to the
direction of making this in some generic way to accomodate any socket
filtering infrastructure.
Post by Alexei Starovoitov
[2] http://patchwork.ozlabs.org/patch/328926/
[3] http://patchwork.ozlabs.org/patch/328928/
http://people.netfilter.org/pablo/nft-sock-filter-test.c
I'm currently reusing the existing libnftnl interfaces, my plan is to
new interfaces in that library for easier and more simple filter
definition for socket filtering.
Note that the current nf_tables expression-set is also limited with
regards to BPF, but the infrastructure that we have can be easily
extended with new expressions.
Comments welcome!
Hi Pablo,
Could you share what performance you're getting when doing nft
filter equivalent to 'tcpdump port 22' ?
Meaning your filter needs to parse eth->proto, ip or ipv6 header and
check both ports. How will it compare with JITed bpf/ebpf ?
We already have plans to add jit to nf_tables.
Post by Alexei Starovoitov
I was trying to go the other way: improve nft performance with ebpf.
10/40G links are way to fast for interpreters. imo JIT is the only way.
- if (fp->bpf_func != sk_run_filter)
- module_free(NULL, fp->bpf_func);
+ if (fp->run_filter != sk_run_filter)
+ module_free(NULL, fp->run_filter);
David suggested that these comparisons in all jits are ugly.
I've fixed it in my patches. When they're in, you wouldn't need to
mess with this.
I see you have added fp->jited for this. I think we can make this more
generic if we have some enum so fp->type will tell us what kind of
filter we have, ie. bpf, bpf-jitted, nft, etc.
Post by Alexei Starovoitov
- atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+ atomic_sub(fp->size, &sk->sk_omem_alloc);
that's a big change in socket memory accounting.
We used to account for the whole sk_filter... now you're counting
filter size only.
Is it valid?
We need this change for if nf_tables. We don't use a fixed layout for
each instruction of the filter like you do, I mean:

+struct sock_filter_ext {
+ __u8 code; /* opcode */
+ __u8 a_reg:4; /* dest register */
+ __u8 x_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};

If I didn't miss anything from your patchset, that structure is again
exposed to userspace, so we won't be able to change it ever unless we
add a new binary interface.

From the user interface perspective, our nft approach is quite
different, from userspace you express the filter using TLVs (like in
the payload of the netlink message). Then, nf_tables core
infrastructure parses this and transforms it to the kernel
representation. It's very flexible since you don't expose the internal
representation to userspace, which means that we can change the
internal layout anytime, thus, highly extensible.

BTW, how do you make when you don't need the imm field? You just leave
it unset I guess. And how does the code to compare an IPv6 address
looks like?
Post by Alexei Starovoitov
whole nft_expr_autoload() looks scary from security point of view.
If I'm reading it correctly, the code will do request_module() based on
userspace request to attach filter?
Only root can invoke that code so far.
Post by Alexei Starovoitov
+ len = sk_nft_get_filter(sk, (struct sock_filter __user
*)optval, len);
with my patches there was a concern regarding socket checkpoint/restore
and I had to preserve existing filter image to make sure it's not broken.
Could you please coordinate with Pavel and co to test this piece?
What will happen if nft_filter attached, but so_get_filter is called? crash?
I was trying to avoid to add some enum to identify to sk_filter to
identify what kind of filter we have there, but this is needed
indeed. Thanks for spotting this.
Post by Alexei Starovoitov
+static int nft_sock_expr_autoload(const struct nft_ctx *ctx,
+ const struct nlattr *nla)
+{
+#ifdef CONFIG_MODULES
+ mutex_unlock(&nft_expr_info_mutex);
+ request_module("nft-expr-%.*s", nla_len(nla), (char *)nla_data(nla));
+ mutex_lock(&nft_expr_info_mutex);
same security concern here...
+int sk_nft_attach_filter(char __user *optval, struct sock *sk)
+{
what about sk_clone_lock()? since filter program is in nft, do you need to do
special steps during copy of socket?
+ fp = sock_kmalloc(sk, sizeof(struct sk_filter) + size, GFP_KERNEL);
this may allocate more memory then you need.
Currently sk_filter_size() computes it in an accurate way.
That won't work for nf_tables, we don't necessarily have to stick to a
fixed layout per instruction like you do, so the real filter size is
obtained after parsing the TLVs that was passed from userspace.
Post by Alexei Starovoitov
Also the same issue of optmem accounting as I mentioned in 2/9
+ sock_kfree_s(sk, fp, size);
a small bug: allocated sizeof(sk_filter)+size, but freeing 'size' only...
Good catch, I'll fix it, thanks.
Post by Alexei Starovoitov
Overall I think it's very interesting work.
Not sure what's the use case for it though.
I'll cook up a patch for the opposite approach (use ebpf inside nft)
and will send you for review.
I don't like the idea of sticking to some fixed layout structure to
represent the filter, please hold on with it.
Post by Alexei Starovoitov
I would prefer to work together to satisfy your and our user requests.
That would be nice, but so far I think that we have different
approaches from user interface perspective and, thus, the kernel
design for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pablo Neira Ayuso
2014-03-12 09:27:23 UTC
Permalink
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
whole nft_expr_autoload() looks scary from security point of view.
If I'm reading it correctly, the code will do request_module() based on
userspace request to attach filter?
Only root can invoke that code so far.
Oops, this is obviously wrong. This request_module part needs a fix
indeed for the non-root part.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alexei Starovoitov
2014-03-13 03:29:55 UTC
Permalink
Post by Pablo Neira Ayuso
Hi!
I'm going to reply to Daniel and you in the same email, see below.
struct sk_filter
{
atomic_t refcnt;
- unsigned int len; /* Number of filter blocks */
+ /* len - number of insns in sock_filter program
+ * len_ext - number of insns in socket_filter_ext program
+ * jited - true if either original or extended program was
JITed
+ * orig_prog - original sock_filter program if not NULL
+ */
+ unsigned int len;
+ unsigned int len_ext;
+ unsigned int jited:1;
+ struct sock_filter *orig_prog;
struct rcu_head rcu;
- unsigned int (*bpf_func)(const struct sk_buff *skb,
- const struct sock_filter
*filter);
+ union {
+ unsigned int (*bpf_func)(const struct sk_buff *skb,
+ const struct sock_filter *fp);
+ unsigned int (*bpf_func_ext)(const struct sk_buff *skb,
+ const struct sock_filter_ext *fp);
+ };
union {
struct sock_filter insns[0];
+ struct sock_filter_ext insns_ext[0];
struct work_struct work;
};
};
I think we have to generalise this to make it flexible to accomodate
any socket filtering infrastructure. For example, instead of having
bpf_func and bpf_func_ext, I think it would be good to generalise it
that so we pass some void *filter. I also think that other private
well, David indicated that using 'void*' for such cases is undesirable,
since we want to rely on compiler to do type verification as much
as we can. My patches are preserving type safety.
'void * filter' would mean - open the door for anything.
I don't think we want that type of 'generality'.
Post by Pablo Neira Ayuso
information to the filtering approach should be put after the
filtering code, in some variable length area.
This change looks quite ad-hoc. My 1-3 patches were more going to the
direction of making this in some generic way to accomodate any socket
filtering infrastructure.
They may look ad-hoc, but they're preserving type checking.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Could you share what performance you're getting when doing nft
filter equivalent to 'tcpdump port 22' ?
Meaning your filter needs to parse eth->proto, ip or ipv6 header and
check both ports. How will it compare with JITed bpf/ebpf ?
We already have plans to add jit to nf_tables.
The patches don't explain the reasons to do nft socket filtering.
I can only guess and my guess that this is to show
that nft sort of can do what bpf can.
tc can be made to do socket filtering too.
The differentiation is speed and ease of use.
Both have big question marks in sock_filter+nft approach.

I think to consider nft to be one and only classifier, some
benchmarking needs to be done first:
nft vs bpf, nft vs tc, nft vs ovs, ...

It can be done the other way too. nft can run on top of tc.
ovs can run on top of tc and so on.
I'm not advocating any of that.

Having one interpreter underneath doesn't mean that all
components will be easier to maintain or have less code around.
Code that converts from one to another counts as well.
Simplicity and performance should be the deciding factor.
imo nft+sock_filter example is not simple.

I've posted patches to compile restricted C into ebpf.
Theoretically I can make 'universal kernel module' out of ebpf.
Like, compile any C code into ebpf and jit it.
Such 'universal kernel module' will be runnable on all architectures.
One compiler to rule them all... one ebpf to run them all... NO!
That may be cool thing for university research, but no good
reason to do it in practice.
Same way nft for socket filtering is a cool research, but what
is the strong reason to have it in kernel and maintain forever?
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
- if (fp->bpf_func != sk_run_filter)
- module_free(NULL, fp->bpf_func);
+ if (fp->run_filter != sk_run_filter)
+ module_free(NULL, fp->run_filter);
David suggested that these comparisons in all jits are ugly.
I've fixed it in my patches. When they're in, you wouldn't need to
mess with this.
I see you have added fp->jited for this. I think we can make this more
generic if we have some enum so fp->type will tell us what kind of
filter we have, ie. bpf, bpf-jitted, nft, etc.
Such enum will have a problem of explosion when flags
start to cross-multiply.
fp->jited flag just says jitted or not. Easier to check.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
- atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+ atomic_sub(fp->size, &sk->sk_omem_alloc);
that's a big change in socket memory accounting.
We used to account for the whole sk_filter... now you're counting
filter size only.
Is it valid?
We need this change for if nf_tables. We don't use a fixed layout for
I think you missed the point.
Allocated sockopt memory is not counted properly.
It's not fixed vs non-fixed. sizeof(struct sk_filter) needs to
be accounted too, since it was allocated.
Post by Pablo Neira Ayuso
+struct sock_filter_ext {
+ __u8 code; /* opcode */
+ __u8 a_reg:4; /* dest register */
+ __u8 x_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
If I didn't miss anything from your patchset, that structure is again
exposed to userspace, so we won't be able to change it ever unless we
add a new binary interface.
From the user interface perspective, our nft approach is quite
different, from userspace you express the filter using TLVs (like in
the payload of the netlink message). Then, nf_tables core
infrastructure parses this and transforms it to the kernel
representation. It's very flexible since you don't expose the internal
representation to userspace, which means that we can change the
internal layout anytime, thus, highly extensible.
bpf/ebpf is an instruction set. Arguing about fixed vs non-fixed
insn size is like arguing x86 variable encoding vs sparc fixed.
Both are infinitely flexible.
ebpf is a low level insn set with defined calling convention, so
ebpf program can call fixed set of kernel functions if ebpf checker
allows that.
In sockfilters+ebpf, seccomp+ebpf, ovs+ebpf and tracing+ebpf
I've already demonstrated that ebpf instruction set, its
interpreter and its JIT are staying the same, while all are doing
very different things. Cannot think of better extensibility.

Extensibility is not with new instructions. We don't add new
instructions to x86 just to do a new feature.
New instructions are added to CPUs to make feature faster.
Like aes crypto can be done with normal x86 insns and it can
be done with aseni intel extensions. Same way aes can be done
with ebpf and ebpf doesn't need new instructions to do that.
Post by Pablo Neira Ayuso
BTW, how do you make when you don't need the imm field? You just leave
it unset I guess. And how does the code to compare an IPv6 address
looks like?
how x86 or sparc instruction set handles ipv6 addresses
without 128-bit registers? ebpf does the same.

nft design picked 128 bit registers because of ipv6 addresses?
That makes it difficult to jit.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
whole nft_expr_autoload() looks scary from security point of view.
If I'm reading it correctly, the code will do request_module() based on
userspace request to attach filter?
Only root can invoke that code so far.
If we want to allow non-root access the whole nft needs to
be scrutinized.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
+ fp = sock_kmalloc(sk, sizeof(struct sk_filter) + size, GFP_KERNEL);
this may allocate more memory then you need.
Currently sk_filter_size() computes it in an accurate way.
That won't work for nf_tables, we don't necessarily have to stick to a
fixed layout per instruction like you do, so the real filter size is
obtained after parsing the TLVs that was passed from userspace.
That sock_malloc is allocating sizeof(sk_filter) + size.
Meaning that it allocated unnecessary sizeof(work_struct) bytes
and not accounting for them and for sk_filter itself in filter charge/uncharge
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Also the same issue of optmem accounting as I mentioned in 2/9
+ sock_kfree_s(sk, fp, size);
a small bug: allocated sizeof(sk_filter)+size, but freeing 'size' only...
Good catch, I'll fix it, thanks.
Post by Alexei Starovoitov
Overall I think it's very interesting work.
Not sure what's the use case for it though.
I'll cook up a patch for the opposite approach (use ebpf inside nft)
and will send you for review.
I don't like the idea of sticking to some fixed layout structure to
represent the filter, please hold on with it.
Post by Alexei Starovoitov
I would prefer to work together to satisfy your and our user requests.
That would be nice, but so far I think that we have different
approaches from user interface perspective and, thus, the kernel
design for it.
Since you're planning to do nft-jit, you'd need to generate bits and
bytes for different architectures with their own quirks.
Don't you want to offload this work to somebody who already did this
and who understand quirks of different architectures?
Now imagine that you can generate some intermediate representation
that maps one to one to x86 and other architectures.
Intermediate representation is a pseudo assembler code.
Say you want to translate nft-cmp instruction into sequence of native
comparisons. You'd need to use load from memory, compare and
branch operations. That's ebpf!
Design of ebpf is such that every ebpf insn maps to one native insn.
You can view ebpf as a tool to achieve jiting of nft.
It will save you a lot of time.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pablo Neira Ayuso
2014-03-13 12:30:09 UTC
Permalink
[...]
Post by Alexei Starovoitov
The patches don't explain the reasons to do nft socket filtering.
OK, some reasons from the interface point of view:

1) It provides an extensible interface to userspace. We didn't
invented a new wheel in that regard, we just reused the
extensibility of TLVs used in netlink as intermediate format
between user and kernelspace, also used many other applications
outthere. The TLV parsing and building is not new code, most that of
that code has been exposed to userspace already through netlink.

2) It shows that, with little generalisation, we can open the door to
one single *classification interface* for the users. Just make some
little googling, you'll find *lots of people* barfing on the fact that
we have that many interfaces to classify packets in Linux. And I'm
*not* talking about the packet classification approach itself, that's a
different debate of course.

[...]
Post by Alexei Starovoitov
Simplicity and performance should be the deciding factor.
imo nft+sock_filter example is not simple.
OK, some comments in that regard:

1) Simplicity: With the nft approach you can just use a filter
expressed in json, eg.

{"rule":
{"expr":[
{"type":"meta","dreg":1,"key":"protocol"},
{"type":"cmp","sreg":1,"op":"eq","cmpdata":{"data_reg":{"type":"value","len":2,"data0":"0x00000008"}}},
{"type":"payload","dreg":1,"offset":9,"len":1,"base":"network"},
{"type":"cmp","sreg":1,"op":"eq","cmpdata":{"data_reg":{"type":"value","len":1,"data0":"0x00000006"}}},
{"type":"immediate","dreg":0,"immediatedata":{"data_reg":{"type":"verdict","verdict":"0xffffffff"}}}]
}
}

Which is still more readable and easier to maintain that a BPF
snippet. So you just pass it to the json parser in the libnftnl
library (or whatever other better minimalistic library we would ever
have) which transforms this to TLV format that you can pass to the
kernel. The kernel will do the job to translate this.

How are users going to integrate the restricted C's eBPF code
infrastructure into their projects? I don't think that will be simple.
They will likely include the BPF snippet to avoid all the integration
burden, as it already happens in many projects with BPF filters.

2) Performance. Patrick has been doing many progress with the generic
set infrastructure for nft. In that regard, we aim to achieve
performance by arranging data using performance data structures,
*jit is not the only way to boost performance* (although we also
want to have it).

Some example:

set type IPv4 address = { N thousands of IPv4 addresses }

reg1 <- payload(network header, offsetof(struct iphdr, daddr), 4)
lookup(reg1, set)
reg0 <- immediate(0xffffffff)

Or even better, using dictionaries:

set type IPv4 address = { 1.1.1.1 : accept, 2.2.2.2 : accept, 3.3.3.3 : drop ...}

reg1 <- payload(network header, offsetof(struct iphdr, daddr), 4)
reg0 <- lookup(reg1, set)

Where "accept" is an alias of 0xffffffff and "drop" is 0 in the
nft-sock case. The verdict part has been generalised so we can adapt
nft to the corresponding packet classification engine.

This part just needs a follow-up patch to add set-based filtering for
nft-sockets. See this for more information about ongoing efforts on
the nft set infrastructure: http://patchwork.ozlabs.org/patch/326368/
that will also integrate into nft-sock.

[...]
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
+struct sock_filter_ext {
+ __u8 code; /* opcode */
+ __u8 a_reg:4; /* dest register */
+ __u8 x_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
If I didn't miss anything from your patchset, that structure is again
exposed to userspace, so we won't be able to change it ever unless we
add a new binary interface.
From the user interface perspective, our nft approach is quite
different, from userspace you express the filter using TLVs (like in
the payload of the netlink message). Then, nf_tables core
infrastructure parses this and transforms it to the kernel
representation. It's very flexible since you don't expose the internal
representation to userspace, which means that we can change the
internal layout anytime, thus, highly extensible.
bpf/ebpf is an instruction set. Arguing about fixed vs non-fixed
insn size is like arguing x86 variable encoding vs sparc fixed.
Both are infinitely flexible.
Right, you can extend interfaces forever with lots of patchwork and
"smart tricks" but that doesn't mean that will look nice...

As I said, I believe that having a nice extensible interface is
extremely important to make it easier for development. If we have to
rearrange the internal representation for some reason, we can do it
indeed without bothering about making translations to avoid breaking
userspace and having to use ugly tricks (just see sk_decode_filter()
or any other translation to support any new way to express a
filter...).

[...]
Post by Alexei Starovoitov
Say you want to translate nft-cmp instruction into sequence of native
comparisons. You'd need to use load from memory, compare and
branch operations. That's ebpf!
Nope sorry, that's not ebpf. That's assembler code.

[...]
Post by Alexei Starovoitov
You can view ebpf as a tool to achieve jiting of nft.
It will save you a lot of time.
nft interface is already well-abstracted from the representation, so I
don't find a good reason to make a step backward that will force us to
represent the instructions using a fixed layout structure that is
exposed to userspace, that we won't be able to change once if this
gets into mainstream.

Probably the nft is not the easiest path, I agree, it's been not so
far if you look at the record. But with time and development hours
from everyone, I believe we'll enjoy a nice framework.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alexei Starovoitov
2014-03-14 15:28:16 UTC
Permalink
Post by Pablo Neira Ayuso
[...]
It seems you're assuming that ebpf inherited all the shortcomings
of bpf and making conclusion based on that. Not your fault.
I didn't explain it well enough.
Technically ebpf is a small evolution of bpf, but applicability made a
giant leap. I cannot compile C into bpf, but I can do that with ebpf.
I cannot do table lookups in bpf, but I can do that in ebpf.
I cannot rewrite packet headers in bpf, but I can do that in ebpf, etc.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
The patches don't explain the reasons to do nft socket filtering.
1) It provides an extensible interface to userspace. We didn't
invented a new wheel in that regard, we just reused the
extensibility of TLVs used in netlink as intermediate format
between user and kernelspace, also used many other applications
outthere. The TLV parsing and building is not new code, most that of
that code has been exposed to userspace already through netlink.
2) It shows that, with little generalisation, we can open the door to
one single *classification interface* for the users. Just make some
little googling, you'll find *lots of people* barfing on the fact that
we have that many interfaces to classify packets in Linux. And I'm
*not* talking about the packet classification approach itself, that's a
different debate of course.
[...]
Post by Alexei Starovoitov
Simplicity and performance should be the deciding factor.
imo nft+sock_filter example is not simple.
1) Simplicity: With the nft approach you can just use a filter
expressed in json, eg.
{"expr":[
{"type":"meta","dreg":1,"key":"protocol"},
{"type":"cmp","sreg":1,"op":"eq","cmpdata":{"data_reg":{"type":"value","len":2,"data0":"0x00000008"}}},
{"type":"payload","dreg":1,"offset":9,"len":1,"base":"network"},
{"type":"cmp","sreg":1,"op":"eq","cmpdata":{"data_reg":{"type":"value","len":1,"data0":"0x00000006"}}},
{"type":"immediate","dreg":0,"immediatedata":{"data_reg":{"type":"verdict","verdict":"0xffffffff"}}}]
}
}
sorry. It surely looks simple to you, but I cannot figure out what
the above snippet suppose to do. Could you please explain.
Post by Pablo Neira Ayuso
Which is still more readable and easier to maintain that a BPF
snippet. So you just pass it to the json parser in the libnftnl
library (or whatever other better minimalistic library we would ever
have) which transforms this to TLV format that you can pass to the
kernel. The kernel will do the job to translate this.
How are users going to integrate the restricted C's eBPF code
infrastructure into their projects? I don't think that will be simple.
They will likely include the BPF snippet to avoid all the integration
burden, as it already happens in many projects with BPF filters.
what you're seeing is the counter argument to your 'bpf not-simple'
statement :) If bpf snippets were hard to understand, people
wouldn't be including them as-is in their programs.
One can always do 'tcpdump expression -d' to generate bpf snippet
or use libpcap in their programs to dynamically generate them.
libseccomp dynamically generates bpf too.
Post by Pablo Neira Ayuso
2) Performance. Patrick has been doing many progress with the generic
set infrastructure for nft. In that regard, we aim to achieve
performance by arranging data using performance data structures,
*jit is not the only way to boost performance* (although we also
want to have it).
set type IPv4 address = { N thousands of IPv4 addresses }
reg1 <- payload(network header, offsetof(struct iphdr, daddr), 4)
lookup(reg1, set)
reg0 <- immediate(0xffffffff)
set type IPv4 address = { 1.1.1.1 : accept, 2.2.2.2 : accept, 3.3.3.3 : drop ...}
reg1 <- payload(network header, offsetof(struct iphdr, daddr), 4)
reg0 <- lookup(reg1, set)
Where "accept" is an alias of 0xffffffff and "drop" is 0 in the
nft-sock case. The verdict part has been generalised so we can adapt
nft to the corresponding packet classification engine.
that example is actually illustrates that nft needs to be continuously
tweaked to add features like 'set'. We can do better.

Here is the example from V2 series that shows how hash tables can be
used in C that translates to ebpf, without changing ebpf itself:
void dropmon(struct kprobe_args *ctx)
{
void *loc;
uint64_t *drop_cnt;
/* skb:kfree_skb is defined as:
* TRACE_EVENT(kfree_skb,
* TP_PROTO(struct sk_buff *skb, void *location),
* so ctx->arg2 is 'location'
*/
loc = (void *)ctx->arg2;

drop_cnt = bpf_table_lookup(ctx, 0, &loc);
if (drop_cnt) {
__sync_fetch_and_add(drop_cnt, 1);
} else {
uint64_t init = 0;
bpf_table_update(ctx, 0, &loc, &init);
}
}
the above C program compiles into ebpf, attaches to kfree_skb() tracepoint
and counts packet drops at different locations.
userspace can read the table and print it in user friendly way while
the tracing filter is running or when it's stopped.
That's an example of fast drop monitor that is safe to insert into live kernel.

Actually I think it's wrong to even compare nft with ebpf.
ebpf doesn't dictate a new user interface. At all.
There is old bpf to write socket filters. It's good enough.
I'm not planning to hack libpcap just to generate ebpf.

User interface is orthogonal to kernel implementation.
We can argue whether C representation of filter is better than json,
but what kernel runs at the lowest level is independent of that.

Insisting that user interface and kernel representation must be
one-to-one is unnecessary restrictive. User interface can and
should evolve independently of what kernel is doing underneath.

In case of socket filters tcpdump/libpcap syntax is the user interface.
old bpf is a kernel-user api. I don't think there is a strong need
to change either. ebpf is not touching them, but helping to execute
tcpdump filters faster.
In case of tracing filters I propose C-like user interface.
Kernel API for translated C programs is a different matter.
ebpf in the kernel is just the engine to execute it.
1st and 2nd may look completely different after community feedback,
but in kernel ebpf engine can stay unmodified.
Post by Pablo Neira Ayuso
Right, you can extend interfaces forever with lots of patchwork and
"smart tricks" but that doesn't mean that will look nice...
I'm not sure what you mean here.
Post by Pablo Neira Ayuso
As I said, I believe that having a nice extensible interface is
extremely important to make it easier for development. If we have to
rearrange the internal representation for some reason, we can do it
indeed without bothering about making translations to avoid breaking
userspace and having to use ugly tricks (just see sk_decode_filter()
or any other translation to support any new way to express a
filter...).
nice that your brought this up :)
As I mentioned in v4 thread sk_decode_filter() can be removed.
It was introduced to improve old interpreter performance and now
this part is obsolete.
Post by Pablo Neira Ayuso
[...]
Post by Alexei Starovoitov
Say you want to translate nft-cmp instruction into sequence of native
comparisons. You'd need to use load from memory, compare and
branch operations. That's ebpf!
Nope sorry, that's not ebpf. That's assembler code.
Well, in my previous email I tried to explain that assembler == ebpf :)
Please post x86_64 assembler code that future nft-jit suppose to
generate and I can post equivalent ebpf code that will be jited
exactly to your x86_64...
Post by Pablo Neira Ayuso
[...]
Post by Alexei Starovoitov
You can view ebpf as a tool to achieve jiting of nft.
It will save you a lot of time.
nft interface is already well-abstracted from the representation, so I
don't find a good reason to make a step backward that will force us to
represent the instructions using a fixed layout structure that is
exposed to userspace, that we won't be able to change once if this
gets into mainstream.
I don't think we're on the same page still.
To make this more productive, please say what feature you would
want to see supported and I can show how it is done without
changing ebpf insn set.
Post by Pablo Neira Ayuso
Probably the nft is not the easiest path, I agree, it's been not so
far if you look at the record. But with time and development hours
from everyone, I believe we'll enjoy a nice framework.
No doubt that nftables is a nice framework. Let's keep it going
and let's make it faster.

Regards,
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pablo Neira Ayuso
2014-03-14 18:16:19 UTC
Permalink
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
[...]
It seems you're assuming that ebpf inherited all the shortcomings
of bpf and making conclusion based on that. Not your fault.
I didn't explain it well enough.
Frankly, from the *interface point of view* it is indeed inheriting
all of its shortcomings...
Post by Alexei Starovoitov
Technically ebpf is a small evolution of bpf, but applicability made a
giant leap. I cannot compile C into bpf, but I can do that with ebpf.
Good that we can get better userspace tools, but still the layout of
your instructions is exposed to userspace so it cannot be changed
*ever*. The kernel interface is still an array of binary structures of
fixed size just like BPF does.

Why do we have to have a 32 bits immediate that may be zero most of
the time? Just because:

1) You needed to align your new instruction layout to 64 bits / two 32
bits words.
2) You noticed you get better performance with those changes,
including the new "if" statement that works better with branch
prediction.
3) It's easier for you to make the jit translation.

That means your interface is exposing *all of your internal
implementation decisions* and that's a very bad since we will always
have to come up with smart tricks not to break backward compatibility
if we want to improve the internal implementation.
Post by Alexei Starovoitov
I cannot do table lookups in bpf, but I can do that in ebpf.
I cannot rewrite packet headers in bpf, but I can do that in ebpf, etc.
Sorry, I don't buy this "we get more features" if in the long run we
have restricted extensibility.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
The patches don't explain the reasons to do nft socket filtering.
1) It provides an extensible interface to userspace. We didn't
invented a new wheel in that regard, we just reused the
extensibility of TLVs used in netlink as intermediate format
between user and kernelspace, also used many other applications
outthere. The TLV parsing and building is not new code, most that of
that code has been exposed to userspace already through netlink.
2) It shows that, with little generalisation, we can open the door to
one single *classification interface* for the users. Just make some
little googling, you'll find *lots of people* barfing on the fact that
we have that many interfaces to classify packets in Linux. And I'm
*not* talking about the packet classification approach itself, that's a
different debate of course.
[...]
Post by Alexei Starovoitov
Simplicity and performance should be the deciding factor.
imo nft+sock_filter example is not simple.
1) Simplicity: With the nft approach you can just use a filter
expressed in json, eg.
{"expr":[
{"type":"meta","dreg":1,"key":"protocol"},
{"type":"cmp","sreg":1,"op":"eq","cmpdata":{"data_reg":{"type":"value","len":2,"data0":"0x00000008"}}},
{"type":"payload","dreg":1,"offset":9,"len":1,"base":"network"},
{"type":"cmp","sreg":1,"op":"eq","cmpdata":{"data_reg":{"type":"value","len":1,"data0":"0x00000006"}}},
{"type":"immediate","dreg":0,"immediatedata":{"data_reg":{"type":"verdict","verdict":"0xffffffff"}}}]
}
}
sorry. It surely looks simple to you, but I cannot figure out what
the above snippet suppose to do. Could you please explain.
1) Fetch skb->protocol put it into a register.
2) Compare it to ETHERPROTO_IP. If not matching, break.
3) Fetch payload byte offset 9, only 1 byte.
4) Compare it with IPPROTO_TCP. If not matching, break.
5) If matching, pass the packet to userspace.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
Which is still more readable and easier to maintain that a BPF
snippet. So you just pass it to the json parser in the libnftnl
library (or whatever other better minimalistic library we would ever
have) which transforms this to TLV format that you can pass to the
kernel. The kernel will do the job to translate this.
How are users going to integrate the restricted C's eBPF code
infrastructure into their projects? I don't think that will be simple.
They will likely include the BPF snippet to avoid all the integration
burden, as it already happens in many projects with BPF filters.
what you're seeing is the counter argument to your 'bpf not-simple'
statement :) If bpf snippets were hard to understand, people
wouldn't be including them as-is in their programs.
No, we had no other choice, it was the best thing that we had so far.
People have been just including that "magic code", BPF is quite
unreadable, your extension doesn't make any better.
Post by Alexei Starovoitov
One can always do 'tcpdump expression -d' to generate bpf snippet
or use libpcap in their programs to dynamically generate them.
libseccomp dynamically generates bpf too.
Autogeneration tools are of course good to have, that can be achieved
with *any* framework. I don't think this is the main point of this
discussion.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
2) Performance. Patrick has been doing many progress with the generic
set infrastructure for nft. In that regard, we aim to achieve
performance by arranging data using performance data structures,
*jit is not the only way to boost performance* (although we also
want to have it).
set type IPv4 address = { N thousands of IPv4 addresses }
reg1 <- payload(network header, offsetof(struct iphdr, daddr), 4)
lookup(reg1, set)
reg0 <- immediate(0xffffffff)
set type IPv4 address = { 1.1.1.1 : accept, 2.2.2.2 : accept, 3.3.3.3 : drop ...}
reg1 <- payload(network header, offsetof(struct iphdr, daddr), 4)
reg0 <- lookup(reg1, set)
Where "accept" is an alias of 0xffffffff and "drop" is 0 in the
nft-sock case. The verdict part has been generalised so we can adapt
nft to the corresponding packet classification engine.
that example is actually illustrates that nft needs to be continuously
tweaked to add features like 'set'. We can do better.
No, it just proofs that our framework is extensible and that we can
improve easily.
Post by Alexei Starovoitov
Here is the example from V2 series that shows how hash tables can be
void dropmon(struct kprobe_args *ctx)
{
void *loc;
uint64_t *drop_cnt;
* TRACE_EVENT(kfree_skb,
* TP_PROTO(struct sk_buff *skb, void *location),
* so ctx->arg2 is 'location'
*/
loc = (void *)ctx->arg2;
drop_cnt = bpf_table_lookup(ctx, 0, &loc);
Is there room to extend your framework with any other data structure
which is *not* a hashtable? What are you plans for that?
Post by Alexei Starovoitov
if (drop_cnt) {
__sync_fetch_and_add(drop_cnt, 1);
} else {
uint64_t init = 0;
bpf_table_update(ctx, 0, &loc, &init);
}
}
the above C program compiles into ebpf, attaches to kfree_skb() tracepoint
and counts packet drops at different locations.
userspace can read the table and print it in user friendly way while
the tracing filter is running or when it's stopped.
That's an example of fast drop monitor that is safe to insert into live kernel.
Actually I think it's wrong to even compare nft with ebpf.
ebpf doesn't dictate a new user interface. At all.
There is old bpf to write socket filters. It's good enough.
I'm not planning to hack libpcap just to generate ebpf.
User interface is orthogonal to kernel implementation.
We can argue whether C representation of filter is better than json,
but what kernel runs at the lowest level is independent of that.
Insisting that user interface and kernel representation must be
one-to-one is unnecessary restrictive. User interface can and
should evolve independently of what kernel is doing underneath.
In case of socket filters tcpdump/libpcap syntax is the user interface.
No, that's not the real interface. That's a wrapper / abstraction over
the kernel interface is exposing.
Post by Alexei Starovoitov
old bpf is a kernel-user api. I don't think there is a strong need
to change either. ebpf is not touching them, but helping to execute
tcpdump filters faster.
In case of tracing filters I propose C-like user interface.
Kernel API for translated C programs is a different matter.
ebpf in the kernel is just the engine to execute it.
1st and 2nd may look completely different after community feedback,
but in kernel ebpf engine can stay unmodified.
The only different that I see with ebpf is that you provide nice end
user tools, but the design from the kernel interface has exactly the
same problems.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
Right, you can extend interfaces forever with lots of patchwork and
"smart tricks" but that doesn't mean that will look nice...
I'm not sure what you mean here.
For example, this:

diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..e1b979312588 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
},
#endif
{
+ .procname = "bpf_ext_enable",
+ .data = &bpf_ext_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },

This /proc thing seems to me like the last resource we have to avoid
breaking backward compatibility. I have used it as well myself, but
it's like the *pull alarm* when we did a wrong design decision.

What if we need a new ABI breakage for BPF again? Will we need to add
a new /proc interface for that? As far as I can tell from your
patches, the answer is yes.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
As I said, I believe that having a nice extensible interface is
extremely important to make it easier for development. If we have to
rearrange the internal representation for some reason, we can do it
indeed without bothering about making translations to avoid breaking
userspace and having to use ugly tricks (just see sk_decode_filter()
or any other translation to support any new way to express a
filter...).
nice that your brought this up :)
As I mentioned in v4 thread sk_decode_filter() can be removed.
It was introduced to improve old interpreter performance and now
this part is obsolete.
What are your plans then? Will you implement that converter in
userspace? You mention that you don't want to enhance libpcap, which
seems to me like the natural way to extend things.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
[...]
Post by Alexei Starovoitov
Say you want to translate nft-cmp instruction into sequence of native
comparisons. You'd need to use load from memory, compare and
branch operations. That's ebpf!
Nope sorry, that's not ebpf. That's assembler code.
Well, in my previous email I tried to explain that assembler == ebpf :)
I see, so I was right. You want to expose a pseudo-assembler
interface just because that makes it easier to you to provide the jit
translation.
Post by Alexei Starovoitov
Please post x86_64 assembler code that future nft-jit suppose to
generate and I can post equivalent ebpf code that will be jited
exactly to your x86_64...
That's possible of course. There are many ways to implement the same
thing, they can provide the same features, but not the same degree of
extensibility.
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
[...]
Post by Alexei Starovoitov
You can view ebpf as a tool to achieve jiting of nft.
It will save you a lot of time.
nft interface is already well-abstracted from the representation, so I
don't find a good reason to make a step backward that will force us to
represent the instructions using a fixed layout structure that is
exposed to userspace, that we won't be able to change once if this
gets into mainstream.
I don't think we're on the same page still.
To make this more productive, please say what feature you would
want to see supported and I can show how it is done without
changing ebpf insn set.
Post by Pablo Neira Ayuso
Probably the nft is not the easiest path, I agree, it's been not so
far if you look at the record. But with time and development hours
from everyone, I believe we'll enjoy a nice framework.
No doubt that nftables is a nice framework. Let's keep it going
and let's make it faster.
We want to make it faster, but I don't like the idea of converting
it to bpf just to get the jit fast.

We are spending quite a lot of time to provide good performance
through arraging data in performance data structures, the jit
microoptimization will be just the cherry on the top of the cake.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alexei Starovoitov
2014-03-15 04:06:00 UTC
Permalink
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
[...]
It seems you're assuming that ebpf inherited all the shortcomings
of bpf and making conclusion based on that. Not your fault.
I didn't explain it well enough.
Frankly, from the *interface point of view* it is indeed inheriting
all of its shortcomings...
Hi Pablo, David,

Let's go back to what ebpf is...
ebpf == generalization of assembler instructions across different architectures.

Take x86_64 instructions ld/st/alu/mov/cmp/call/jmp and
then rename them into my_ld, my_st, my_add, my_call, etc
Then do the same for arm64.
Also rename register names into r0,r1,r2
and remember how you did the mapping.
Also analyze x86_64, arm64 call convention, so that callee saved
registers are mapped to the same regs and arguments are passed
in r1, r2, ...

A function call in such assembler will look like:
my_mov r1, 1
my_mov r2, 2
my_call foo

that maps back to x86_64:
mov rdi, 1
mov rsi, 2
call foo

Since calling convention is compatible between 'renamed assembler'
and original x86_64 or arm assembler, the program written in 'renamed
assembler' can call native functions directly.
The opposite is also true.
Functions written in x86 assembler or C can call into functions
written in 'renamed' assembler.
Example:

f1.s:
mov rdi, 1
mov rsi, 2
call f2
ret

f2.s:
my_mov r3, r1
my_mov r2, r1
my_mov r1, r3
my_call f3
my_ret

f3.s:
mov rax, rdi
sub rax, rsi
ret

fyi, in C these assembler blobs roughly do:
u64 f1() { return f2(1,2); }
u64 f2(u64 a, u64 b) { return f3(b, a); }
u64 f3(u64 a, u64 b) { return a - b; }

f1.s and f3.s are written in x86_64 and f2.s is written in 'renamed assembler'.

compile f1.s, f3.s into binary x86 code
compile f2.s into some binary code
(either fixed insn size or variable, that's irrelevant), let's call it format B

Now load all three binary blobs into kernel.
1st and 3rd blob can be loaded as-is.
2nd blob needs to be remapped from format B into x86_64 binary code.

After that CPU can call f1() and receive 1 back.

What programs can be written in x86_64 assembler? Anything.
What programs can be written in renamed assembler? Anything.

How often do we want to extend x86_64 assembler? Rarely.
Only when an algorithm implemented in pure x86_64 needs
mmx/ssa acceleration.
Intel does not extend x86 to add a feature, but to accelerate a feature.
Same with 'renamed' assembler.
Any algorithm can be implemented using renamed assembler.

So what is ebpf? It's a format B. It can be fixed size or variable.
That is irrelevant. While loading, the program in format B is
reverse mapped into x86 binary code.

What programs can be written in format B? Anything.
Does format B needs to be extended to support nft? no
to support socket filters? no
to support tracing filters? no
to support crazy idea that will come N years from now? no
Hard to believe? Think back that it is renamed x86 assembler.

Format B was chosen to look like bpf to make an adoption easier
and to make conversion from bpf to ebpf trivial,
but looks like it was a bad idea.
I should have just called it 'simplified x86_64 insn set'.

Now about 'user interface point of view'...
old bpf, netlink, nft format are interfaces.
Until format B is exposed to user space it is not an interface.
nftables can use format B to jit the code.
nftables's user interface doesn't change.

In the patches I sent, ebpf is _not_ exposed to the user.
My patch set immediately helps performance of existing
socket filters and seccomp.
And can do jitting for nft.

Another way of thinking about ebpf:
ebpf is a different way of encoding x86 insns.

I also think we can expose ebpf to the user space,
but that's a different patch and a different discussion.

Thanks!

Hi Pablo,
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Technically ebpf is a small evolution of bpf, but applicability made a
giant leap. I cannot compile C into bpf, but I can do that with ebpf.
Good that we can get better userspace tools, but still the layout of
your instructions is exposed to userspace so it cannot be changed
*ever*. The kernel interface is still an array of binary structures of
fixed size just like BPF does.
that fixed size is irrelevant from extensibility point of view.
sparc has fixed size instructions too, but we don't change sparc
instruction width.
Let's say we decided to remap all sparc instructions and add new
pseudo instructions. These pseudo sparc insns won't buy us any
performance, because in the end they're remapped into real
instructions that cpu can execute.
These fake new pseudo sparc instructions won't give us
any new features either.

Format B should not be changed.
We can add new instructions if we really really need,
but there will not be a need to change existing insns.
Hard to believe? Think back that it is simplified x86.
We don't have a need to change existing x86 insns.

Example 1:
there are xadd_w and xadd_dw insns in format B to do
atomic increments. They don't have to be in the instruction set.
I've added them for performance and not because it cannot
be done without them.
atomic increments could have been done with function call.
ebpf call insn is #1 instruction that was missing in bpf.
It makes ebpf usable for any job.
ebpf program can always call a function.

Example 2:
In old bpf there are many extensions that fetch skb->protocol,
skb->len, skb->pkt_type and so on.
One extension per skb field. That was bad.
They were done as instruction extensions in old bpf,
because bpf didn't have a generic load operation.
ebpf doesn't need extensions for them. All these old bpf
extensions are converted to generic 'load' insn in ebpf.
and jited to x86 as single load, whereas old bpf jit needs
to have its own 'case' statement for every extension.

We can gradually replace old bpf jits with new ebpf jits.
and take time while doing this without exposing ebpf
to the userspace.

Example 3:
Old bpf_s_anc_nlattr extension cannot be jited with
current bpf jit, because it's too complicated
(requires thinking about calling convention, etc)
After conversion to ebpf it finally can be jited,
since it becomes a function call in x86.

That's the point. We do not need to change ebpf insn set.
Anything can be implemented as generic load/store operations
and function calls because ebpf == x86 assembler.
Post by Pablo Neira Ayuso
Why do we have to have a 32 bits immediate that may be zero most of
1) You needed to align your new instruction layout to 64 bits / two 32
bits words.
wrong guess. It's not alignment.
Format B could be anything.
I just picked it to be similar to old BPF to be easier to understand.
Apparently it's not that easy.
Post by Pablo Neira Ayuso
2) You noticed you get better performance with those changes,
including the new "if" statement that works better with branch
prediction.
nope. that's a side effect of 'simplified x86 assembler'
Neither x86 nor arm nor other CPUs have 'dual branch'
instructions. All CPUs either branch or fall through and
since ebpf is just a simplified x86 here you have such
style of branches.
Post by Pablo Neira Ayuso
3) It's easier for you to make the jit translation.
sorry, but 'easier' was not a factor.
ebpf instructions are 8-byte wide, just because old bpfs are
8-byte wide. I fitted all x86 instructions into 8 bytes.
Could have picked any other size.

Another way of thinking about ebpf:
ebpf is a different way of encoding x86 insns.

Whether instructions are variable length or fixed, it's
the same complexity to map back to x86.

ebpf interpreter is a different matter.
Interpreter obviously works better with fixed insn size.
ebpf interpreter you see only exists to support architectures
that don't have ebpf->native mapper yet.

If majority thinks that variable length insns will work better,
let's re-encode the whole thing into variable length.
I just don't see what it will buy us.
but I'm fine re-encoding ebpf into any other format.
Obviously 'simplified x86/arm insn set' can have any format,
as long as it is convenient to execute by interpreter,
not too complex for remapping into native
and has room to add instructions.
imo proposed ebpf format fits these three attributes just fine.
Post by Pablo Neira Ayuso
That means your interface is exposing *all of your internal
implementation decisions* and that's a very bad since we will always
have to come up with smart tricks not to break backward compatibility
if we want to improve the internal implementation.
We're not going to go back and break compatibility.
We're not breaking compatibility now either.
Did Intel change x86 encoding? no.
Same with ebpf. We don't need to change ebpf encoding.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
I cannot do table lookups in bpf, but I can do that in ebpf.
I cannot rewrite packet headers in bpf, but I can do that in ebpf, etc.
Sorry, I don't buy this "we get more features" if in the long run we
have restricted extensibility.
That's not productive to keep saying 'restricted extensibility'
without providing a specific example.
Please come up with at least one case that ebpf cannot
handle as presented.
What instructions do you think are missing?
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Here is the example from V2 series that shows how hash tables can be
void dropmon(struct kprobe_args *ctx)
{
void *loc;
uint64_t *drop_cnt;
* TRACE_EVENT(kfree_skb,
* TP_PROTO(struct sk_buff *skb, void *location),
* so ctx->arg2 is 'location'
*/
loc = (void *)ctx->arg2;
drop_cnt = bpf_table_lookup(ctx, 0, &loc);
Is there room to extend your framework with any other data structure
which is *not* a hashtable? What are you plans for that?
Please understand hashtable or xyztable is not an ebpf instruction.
It is a function call.
Generic call.
ebpf can call any function.
ebpf doesn't need to change a single bit to support other tables.
ebpf jits don't need to change either.
How table is implemented is out side of ebpf scope.
Type of keys, values are arbitrary.
bpf_table_lookup() is a C function inside kernel that ebpf
program calls.
Post by Pablo Neira Ayuso
The only different that I see with ebpf is that you provide nice end
user tools, but the design from the kernel interface has exactly the
same problems.
ok. what problems? Please be specific.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
Right, you can extend interfaces forever with lots of patchwork and
"smart tricks" but that doesn't mean that will look nice...
I'm not sure what you mean here.
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..e1b979312588 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
},
#endif
{
+ .procname = "bpf_ext_enable",
+ .data = &bpf_ext_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
This /proc thing seems to me like the last resource we have to avoid
breaking backward compatibility. I have used it as well myself, but
it's like the *pull alarm* when we did a wrong design decision.
What if we need a new ABI breakage for BPF again? Will we need to add
a new /proc interface for that? As far as I can tell from your
patches, the answer is yes.
Where do you see ABI breakage? It's not broken.
I can remove bpf_ext_enable flag.
On or off it doesn't break any user interface.
Socket filters are still loading old bpf programs.
seccomp is still loading old bpf programs.
They get converted on the fly to new ebpf.
I added the flag only to be able to easily benchmark two interpreters.
We're planning to remove old bpf interpreter. It's obsolete.
Just like old sk_decode_filter().
This /proc flag doesn't need to be there. It can be removed.
Not a single user space app will notice the difference,
other than faster performance.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
As I said, I believe that having a nice extensible interface is
extremely important to make it easier for development. If we have to
rearrange the internal representation for some reason, we can do it
indeed without bothering about making translations to avoid breaking
userspace and having to use ugly tricks (just see sk_decode_filter()
or any other translation to support any new way to express a
filter...).
nice that your brought this up :)
As I mentioned in v4 thread sk_decode_filter() can be removed.
It was introduced to improve old interpreter performance and now
this part is obsolete.
What are your plans then? Will you implement that converter in
userspace? You mention that you don't want to enhance libpcap, which
seems to me like the natural way to extend things.
Please see 1/3 patch. Converter from old bpf to ebpf takes 263 lines
of trivial remapping. It's that simple.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Say you want to translate nft-cmp instruction into sequence of native
comparisons. You'd need to use load from memory, compare and
branch operations. That's ebpf!
Nope sorry, that's not ebpf. That's assembler code.
Well, in my previous email I tried to explain that assembler == ebpf :)
I see, so I was right. You want to expose a pseudo-assembler
interface just because that makes it easier to you to provide the jit
translation.
If you're saying that ebpf == assembler, then yes, you're right.
Post by Pablo Neira Ayuso
Post by Alexei Starovoitov
Please post x86_64 assembler code that future nft-jit suppose to
generate and I can post equivalent ebpf code that will be jited
exactly to your x86_64...
That's possible of course. There are many ways to implement the same
thing, they can provide the same features, but not the same degree of
extensibility.
Totally agree. nft is definitely less extensible than ebpf.
You need to change nft for every new feature, whereas I don't need
to change ebpf. I don't need to change ebpf jits either.

Key point is: ebpf does not _need_ to be changed.
There are still plenty of reserved bits, so new instructions can be
added to improve performance, but so far I don't see a need.
We don't _have_ to add instructions. There is always
a way to do with what it has now.
It is a complete instruction set to support any integer program.
Yeah, floating point is not supported and will not be.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pablo Neira Ayuso
2014-03-15 19:04:03 UTC
Permalink
On Fri, Mar 14, 2014 at 09:04:50PM -0700, Alexei Starovoitov wrote:
[...]
Post by Alexei Starovoitov
In the patches I sent, ebpf is _not_ exposed to the user.
From your last patch: http://patchwork.ozlabs.org/patch/329713/

diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9ccaa5b48..4e98fe16ba88 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,5 +1,6 @@
/*
* Linux Socket Filter Data Structures
+ * Extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
*/

#ifndef _UAPI__LINUX_FILTER_H__
@@ -19,7 +20,7 @@
* Try and keep these values and structures similar to BSD,
* especially
* the BPF code definitions which need to match so you can share
* filters
*/
-
+
struct sock_filter { /* Filter block */
__u16 code; /* Actual filter code */
__u8 jt; /* Jump true */
@@ -27,6 +28,14 @@ struct sock_filter { /* Filter block */
__u32 k; /* Generic multiuse field */
};

+struct sock_filter_ext {
+ __u8 code; /* opcode */
+ __u8 a_reg:4; /* dest register */
+ __u8 x_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
+
struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
unsigned short len; /* Number of filter blocks */
struct sock_filter __user *filter;

That sock_filter_ext structure is exposed to userspace as well as many
other new BPF_* macros that you have defined.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alexei Starovoitov
2014-03-15 19:19:28 UTC
Permalink
Post by Pablo Neira Ayuso
[...]
Post by Alexei Starovoitov
In the patches I sent, ebpf is _not_ exposed to the user.
From your last patch: http://patchwork.ozlabs.org/patch/329713/
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9ccaa5b48..4e98fe16ba88 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,5 +1,6 @@
/*
* Linux Socket Filter Data Structures
+ * Extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
*/
#ifndef _UAPI__LINUX_FILTER_H__
@@ -19,7 +20,7 @@
* Try and keep these values and structures similar to BSD,
* especially
* the BPF code definitions which need to match so you can share
* filters
*/
-
+
struct sock_filter { /* Filter block */
__u16 code; /* Actual filter code */
__u8 jt; /* Jump true */
@@ -27,6 +28,14 @@ struct sock_filter { /* Filter block */
__u32 k; /* Generic multiuse field */
};
+struct sock_filter_ext {
+ __u8 code; /* opcode */
+ __u8 a_reg:4; /* dest register */
+ __u8 x_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
+
struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
unsigned short len; /* Number of filter blocks */
struct sock_filter __user *filter;
That sock_filter_ext structure is exposed to userspace as well as many
other new BPF_* macros that you have defined.
For the first few versions of the patchset they were in linux/bpf.h,
but then it was suggested to put them into uapi/linux/filter.h
to make the whole thing consistent with existing sock_filter structure.
So yes, uapi header is changed as:
include/uapi/linux/filter.h | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)

but there is no way to use these #define from user space at present.

As I said I think it's safe to expose it, because these defines won't change,
but if there is a concern I can move it back into linux/bpf.h

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Loading...