|author||Will Drewry <email@example.com>||2012-04-12 16:48:04 -0500|
|committer||James Morris <firstname.lastname@example.org>||2012-04-14 11:13:22 +1000|
Documents how system call filtering using Berkeley Packet Filter programs works and how it may be used. Includes an example for x86 and a semi-generic example using a macro-based code generator. Acked-by: Eric Paris <email@example.com> Signed-off-by: Will Drewry <firstname.lastname@example.org> Acked-by: Kees Cook <email@example.com> v18: - added acked by - update no new privs numbers v17: - remove @compat note and add Pitfalls section for arch checking (firstname.lastname@example.org) v16: - v15: - v14: - rebase/nochanges v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - comment on the ptrace_event use - update arch support comment - note the behavior of SECCOMP_RET_DATA when there are multiple filters (email@example.com) - lots of samples/ clean up incl 64-bit bpf-direct support (firstname.lastname@example.org) - rebase to linux-next v11: - overhaul return value language, updates (email@example.com) - comment on do_exit(SIGSYS) v10: - update for SIGSYS - update for new seccomp_data layout - update for ptrace option use v9: - updated bpf-direct.c for SIGILL v8: - add PR_SET_NO_NEW_PRIVS to the samples. v7: - updated for all the new stuff in v7: TRAP, TRACE - only talk about PR_SET_SECCOMP now - fixed bad JLE32 check (firstname.lastname@example.org) - adds dropper.c: a simple system call disabler v6: - tweak the language to note the requirement of PR_SET_NO_NEW_PRIVS being called prior to use. (email@example.com) v5: - update sample to use system call arguments - adds a "fancy" example using a macro-based generator - cleaned up bpf in the sample - update docs to mention arguments - fix prctl value (firstname.lastname@example.org) - language cleanup (email@example.com) v4: - update for no_new_privs use - minor tweaks v3: - call out BPF <-> Berkeley Packet Filter (firstname.lastname@example.org) - document use of tentative always-unprivileged - guard sample compilation for i386 and x86_64 v2: - move code to samples (email@example.com) Signed-off-by: James Morris <firstname.lastname@example.org>
Diffstat (limited to 'Documentation/prctl')
1 files changed, 163 insertions, 0 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
@@ -0,0 +1,163 @@
+ SECure COMPuting with filters
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number and the system call arguments. This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks. BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+What it isn't
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. It is meant to be
+a tool for sandbox developers to use. Beyond that, policy for logical
+behavior and information flow should be managed with a combination of
+other system hardening techniques and, potentially, an LSM of your
+choosing. Expressive, dynamic filters provide further options down this
+path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+An additional seccomp mode is added and is enabled using the same
+prctl(2) call as the strict seccomp. If the architecture has
+CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
+ Now takes an additional argument which specifies a new filter
+ using a BPF program.
+ The BPF program will be executed over struct seccomp_data
+ reflecting the system call number, arguments, and other
+ metadata. The BPF program must then return one of the
+ acceptable values to inform the kernel which action should be
+ prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
+ The 'prog' argument is a pointer to a struct sock_fprog which
+ will contain the filter program. If the program is invalid, the
+ call will return -1 and set errno to EINVAL.
+ If fork/clone and execve are allowed by @prog, any child
+ processes will be constrained to the same filters and system
+ call ABI as the parent.
+ Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+ run with CAP_SYS_ADMIN privileges in its namespace. If these are not
+ true, -EACCES will be returned. This requirement ensures that filter
+ programs cannot be applied to child processes with greater privileges
+ than the task that installed them.
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+The above call returns 0 on success and non-zero on error.
+A seccomp filter may return any of the following values. If multiple
+filters exist, the return value for the evaluation of a given system
+call will always use the highest precedent value. (For example,
+SECCOMP_RET_KILL will always take precedence.)
+In precedence order, they are:
+ Results in the task exiting immediately without executing the
+ system call. The exit status of the task (status & 0x7f) will
+ be SIGSYS, not SIGKILL.
+ Results in the kernel sending a SIGSYS signal to the triggering
+ task without executing the system call. The kernel will
+ rollback the register state to just before the system call
+ entry such that a signal handler in the task will be able to
+ inspect the ucontext_t->uc_mcontext registers and emulate
+ system call success or failure upon return from the signal
+ The SECCOMP_RET_DATA portion of the return value will be passed
+ as si_errno.
+ SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
+ Results in the lower 16-bits of the return value being passed
+ to userland as the errno without executing the system call.
+ When returned, this value will cause the kernel to attempt to
+ notify a ptrace()-based tracer prior to executing the system
+ call. If there is no tracer present, -ENOSYS is returned to
+ userland and the system call is not executed.
+ A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
+ using ptrace(PTRACE_SETOPTIONS). The tracer will be notified
+ of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
+ the BPF program return value will be available to the tracer
+ via PTRACE_GETEVENTMSG.
+ Results in the system call being executed.
+If multiple filters exist, the return value for the evaluation of a
+given system call will always use the highest precedent value.
+Precedence is only determined using the SECCOMP_RET_ACTION mask. When
+multiple filters return values of the same precedence, only the
+SECCOMP_RET_DATA from the most recently installed filter will be
+The biggest pitfall to avoid during use is filtering on system call
+number without checking the architecture value. Why? On any
+architecture that supports multiple system call invocation conventions,
+the system call numbers may vary based on the specific invocation. If
+the numbers in the different calling conventions overlap, then checks in
+the filters may be abused. Always check the arch value!
+The samples/seccomp/ directory contains both an x86-specific example
+and a more generic example of a higher level macro interface for BPF
+Adding architecture support
+See arch/Kconfig for the authoritative requirements. In general, if an
+architecture supports both ptrace_event and seccomp, it will be able to
+support seccomp filter with minor fixup: SIGSYS support and seccomp return
+value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
+to its arch-specific Kconfig.