Alexei Starovoitov
2014-03-11 17:59:51 UTC
Hi!
The following patchset provides a socket filtering alternative to BPF
which allows you to define your filter using the nf_tables expressions.
Similarly to BPF, you can attach filters via setsockopt()
SO_ATTACH_NFT_FILTER. The filter that is passed to the kernel is
expression list (nested attribute)
expression element (nested attribute)
expression name (string)
expression data (nested attribute)
... specific attribute for this expression go here
This is similar to the netlink format of the nf_tables rules, so we
can re-use most of the infrastructure that we already have in userspace.
The kernel takes the TLV representation and translates it to the native
nf_tables representation.
The patches 1-3 have helped to generalize the existing socket filtering
infrastructure to allow pluging new socket filtering frameworks. Then,
patches 4-8 generalize the nf_tables code by move the neccessary nf_tables
expression and data initialization core infrastructure. Then, patch 9
provides the nf_tables socket filtering capabilities.
Patrick and I have been discussing for a while that part of this
generalisation works should also help to add support for providing a
replacement to the tc framework, so with the necessary work, nf_tables
may provide in the near future packet a single packet classification
framework for Linux.
I'm being curious here ;) as there's currently an ongoing effort onThe following patchset provides a socket filtering alternative to BPF
which allows you to define your filter using the nf_tables expressions.
Similarly to BPF, you can attach filters via setsockopt()
SO_ATTACH_NFT_FILTER. The filter that is passed to the kernel is
expression list (nested attribute)
expression element (nested attribute)
expression name (string)
expression data (nested attribute)
... specific attribute for this expression go here
This is similar to the netlink format of the nf_tables rules, so we
can re-use most of the infrastructure that we already have in userspace.
The kernel takes the TLV representation and translates it to the native
nf_tables representation.
The patches 1-3 have helped to generalize the existing socket filtering
infrastructure to allow pluging new socket filtering frameworks. Then,
patches 4-8 generalize the nf_tables code by move the neccessary nf_tables
expression and data initialization core infrastructure. Then, patch 9
provides the nf_tables socket filtering capabilities.
Patrick and I have been discussing for a while that part of this
generalisation works should also help to add support for providing a
replacement to the tc framework, so with the necessary work, nf_tables
may provide in the near future packet a single packet classification
framework for Linux.
netdev for Alexei's eBPF engine (part 1 at [1,2,3]), which addresses
shortcomings of current BPF and shall long term entirely replace the
current BPF engine code to let filters entirely run in eBPF resp.
eBPF's JIT engine, as I understand, which is also transparently usable
in cls_bpf for classification in tc w/o rewriting on a different filter
language. Performance figures have been posted/provided in [1] as well.
So the plan on your side would be to have an alternative to eBPF, or
build on top of it to reuse its in-kernel JIT compiler?
[1] http://patchwork.ozlabs.org/patch/328927/
[2] http://patchwork.ozlabs.org/patch/328926/
[3] http://patchwork.ozlabs.org/patch/328928/
http://people.netfilter.org/pablo/nft-sock-filter-test.c
I'm currently reusing the existing libnftnl interfaces, my plan is to
new interfaces in that library for easier and more simple filter
definition for socket filtering.
Note that the current nf_tables expression-set is also limited with
regards to BPF, but the infrastructure that we have can be easily
extended with new expressions.
Comments welcome!
I'm currently reusing the existing libnftnl interfaces, my plan is to
new interfaces in that library for easier and more simple filter
definition for socket filtering.
Note that the current nf_tables expression-set is also limited with
regards to BPF, but the infrastructure that we have can be easily
extended with new expressions.
Comments welcome!
Could you share what performance you're getting when doing nft
filter equivalent to 'tcpdump port 22' ?
Meaning your filter needs to parse eth->proto, ip or ipv6 header and
check both ports. How will it compare with JITed bpf/ebpf ?
I was trying to go the other way: improve nft performance with ebpf.
10/40G links are way to fast for interpreters. imo JIT is the only way.
here are some comments about patches:
1/9:
- if (fp->bpf_func != sk_run_filter)
- module_free(NULL, fp->bpf_func);
+ if (fp->run_filter != sk_run_filter)
+ module_free(NULL, fp->run_filter);
David suggested that these comparisons in all jits are ugly.
I've fixed it in my patches. When they're in, you wouldn't need to
mess with this.
2/9:
- atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+ atomic_sub(fp->size, &sk->sk_omem_alloc);
that's a big change in socket memory accounting.
We used to account for the whole sk_filter... now you're counting
filter size only.
Is it valid?
7/9:
whole nft_expr_autoload() looks scary from security point of view.
If I'm reading it correctly, the code will do request_module() based on
userspace request to attach filter?
9/9:
+ case SO_NFT_GET_FILTER:
+ len = sk_nft_get_filter(sk, (struct sock_filter __user
*)optval, len);
with my patches there was a concern regarding socket checkpoint/restore
and I had to preserve existing filter image to make sure it's not broken.
Could you please coordinate with Pavel and co to test this piece?
What will happen if nft_filter attached, but so_get_filter is called? crash?
+static int nft_sock_expr_autoload(const struct nft_ctx *ctx,
+ const struct nlattr *nla)
+{
+#ifdef CONFIG_MODULES
+ mutex_unlock(&nft_expr_info_mutex);
+ request_module("nft-expr-%.*s", nla_len(nla), (char *)nla_data(nla));
+ mutex_lock(&nft_expr_info_mutex);
same security concern here...
+int sk_nft_attach_filter(char __user *optval, struct sock *sk)
+{
what about sk_clone_lock()? since filter program is in nft, do you need to do
special steps during copy of socket?
+ fp = sock_kmalloc(sk, sizeof(struct sk_filter) + size, GFP_KERNEL);
this may allocate more memory then you need.
Currently sk_filter_size() computes it in an accurate way.
Also the same issue of optmem accounting as I mentioned in 2/9
+err4:
+ sock_kfree_s(sk, fp, size);
a small bug: allocated sizeof(sk_filter)+size, but freeing 'size' only...
Overall I think it's very interesting work.
Not sure what's the use case for it though.
I'll cook up a patch for the opposite approach (use ebpf inside nft)
and will send you for review.
I would prefer to work together to satisfy your and our user requests.
Thanks
Alexei
net: rename fp->bpf_func to fp->run_filter
net: filter: account filter length in bytes
net: filter: generalise sk_filter_release
netfilter: nf_tables: move fast operations to header
netfilter: nf_tables: add nft_value_init
netfilter: nf_tables: rename nf_tables_core.c to nf_tables_nf.c
netfilter: nf_tables: move expression infrastructure to built-in core
netfilter: nf_tables: generalize verdict handling and introduce scopes
netfilter: nf_tables: add support for socket filtering
arch/arm/net/bpf_jit_32.c | 25 +-
arch/powerpc/net/bpf_jit_comp.c | 10 +-
arch/s390/net/bpf_jit_comp.c | 16 +-
arch/sparc/net/bpf_jit_comp.c | 8 +-
arch/x86/net/bpf_jit_comp.c | 8 +-
include/linux/filter.h | 28 +-
include/net/netfilter/nf_tables.h | 27 +-
include/net/netfilter/nf_tables_core.h | 84 +++++
include/net/netfilter/nft_reject.h | 3 +-
include/net/sock.h | 8 +-
include/uapi/asm-generic/socket.h | 4 +
net/core/filter.c | 28 +-
net/core/sock.c | 19 ++
net/core/sock_diag.c | 4 +-
net/netfilter/Kconfig | 13 +
net/netfilter/Makefile | 9 +-
net/netfilter/nf_tables_api.c | 440 ++++---------------------
net/netfilter/nf_tables_core.c | 564
+++++++++++++++++++++-----------
net/netfilter/nf_tables_nf.c | 189 +++++++++++
net/netfilter/nf_tables_sock.c | 327 ++++++++++++++++++
net/netfilter/nft_bitwise.c | 35 +-
net/netfilter/nft_byteorder.c | 28 +-
net/netfilter/nft_cmp.c | 43 ++-
net/netfilter/nft_compat.c | 6 +-
net/netfilter/nft_counter.c | 3 +-
net/netfilter/nft_ct.c | 9 +-
net/netfilter/nft_exthdr.c | 3 +-
net/netfilter/nft_hash.c | 12 +-
net/netfilter/nft_immediate.c | 35 +-
net/netfilter/nft_limit.c | 3 +-
net/netfilter/nft_log.c | 3 +-
net/netfilter/nft_lookup.c | 3 +-
net/netfilter/nft_meta.c | 51 ++-
net/netfilter/nft_nat.c | 3 +-
net/netfilter/nft_payload.c | 29 +-
net/netfilter/nft_queue.c | 3 +-
net/netfilter/nft_rbtree.c | 12 +-
net/netfilter/nft_reject.c | 3 +-
38 files changed, 1416 insertions(+), 682 deletions(-)
create mode 100644 net/netfilter/nf_tables_nf.c
create mode 100644 net/netfilter/nf_tables_sock.c
net: filter: account filter length in bytes
net: filter: generalise sk_filter_release
netfilter: nf_tables: move fast operations to header
netfilter: nf_tables: add nft_value_init
netfilter: nf_tables: rename nf_tables_core.c to nf_tables_nf.c
netfilter: nf_tables: move expression infrastructure to built-in core
netfilter: nf_tables: generalize verdict handling and introduce scopes
netfilter: nf_tables: add support for socket filtering
arch/arm/net/bpf_jit_32.c | 25 +-
arch/powerpc/net/bpf_jit_comp.c | 10 +-
arch/s390/net/bpf_jit_comp.c | 16 +-
arch/sparc/net/bpf_jit_comp.c | 8 +-
arch/x86/net/bpf_jit_comp.c | 8 +-
include/linux/filter.h | 28 +-
include/net/netfilter/nf_tables.h | 27 +-
include/net/netfilter/nf_tables_core.h | 84 +++++
include/net/netfilter/nft_reject.h | 3 +-
include/net/sock.h | 8 +-
include/uapi/asm-generic/socket.h | 4 +
net/core/filter.c | 28 +-
net/core/sock.c | 19 ++
net/core/sock_diag.c | 4 +-
net/netfilter/Kconfig | 13 +
net/netfilter/Makefile | 9 +-
net/netfilter/nf_tables_api.c | 440 ++++---------------------
net/netfilter/nf_tables_core.c | 564
+++++++++++++++++++++-----------
net/netfilter/nf_tables_nf.c | 189 +++++++++++
net/netfilter/nf_tables_sock.c | 327 ++++++++++++++++++
net/netfilter/nft_bitwise.c | 35 +-
net/netfilter/nft_byteorder.c | 28 +-
net/netfilter/nft_cmp.c | 43 ++-
net/netfilter/nft_compat.c | 6 +-
net/netfilter/nft_counter.c | 3 +-
net/netfilter/nft_ct.c | 9 +-
net/netfilter/nft_exthdr.c | 3 +-
net/netfilter/nft_hash.c | 12 +-
net/netfilter/nft_immediate.c | 35 +-
net/netfilter/nft_limit.c | 3 +-
net/netfilter/nft_log.c | 3 +-
net/netfilter/nft_lookup.c | 3 +-
net/netfilter/nft_meta.c | 51 ++-
net/netfilter/nft_nat.c | 3 +-
net/netfilter/nft_payload.c | 29 +-
net/netfilter/nft_queue.c | 3 +-
net/netfilter/nft_rbtree.c | 12 +-
net/netfilter/nft_reject.c | 3 +-
38 files changed, 1416 insertions(+), 682 deletions(-)
create mode 100644 net/netfilter/nf_tables_nf.c
create mode 100644 net/netfilter/nf_tables_sock.c
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/