Vincent Tech Blog: August 2009

Friday, August 14, 2009

My patch to pagemap clear_refs

I posted one trivial patch to fix the user input and got accepted in Andrew Morton's mm test tree, Oh yeah, This is my first kernel patch and it got accepted, that is encouraging and making me feel contributing more :-)

http://marc.info/?t=125011994400002&r=1&w=2


    fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity check

    v1 fix the compiling errors and keep the type variable name.

    Andrew Morton pointed out similar string hacking and obfuscated check for zero-length input
    at the end of the function, David Rientjes suggested to use strict_strtol to replace
    simple_strtol, this patch cover above suggestions, add removing of leading and trailing
    whitespace from user input. It does not change function behavious.

    This patch is rebased on mmotm-2009-08-04-14-22.

    Signed-off-by: Vincent Li 

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f884ad4..2a1bef9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -492,21 +492,20 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
                                size_t count, loff_t *ppos)
 {
        struct task_struct *task;
-       char buffer[PROC_NUMBUF], *end;
+       char buffer[PROC_NUMBUF];
        struct mm_struct *mm;
        struct vm_area_struct *vma;
-       int type;
+       long type;

        memset(buffer, 0, sizeof(buffer));
        if (count > sizeof(buffer) - 1)
                count = sizeof(buffer) - 1;
        if (copy_from_user(buffer, buf, count))
                return -EFAULT;
-       type = simple_strtol(buffer, &end, 0);
+       if (strict_strtol(strstrip(buffer), 10, &type))
+               return -EINVAL;
        if (type < CLEAR_REFS_ALL || type > CLEAR_REFS_MAPPED)
                return -EINVAL;
-       if (*end == '\n')
-               end++;
        task = get_proc_task(file->f_path.dentry->d_inode);
        if (!task)
                return -ESRCH;
@@ -542,9 +541,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
                mmput(mm);
        }
        put_task_struct(task);
-       if (end - buffer == 0)
-               return -EIO;
-       return end - buffer;
+
+       return count;
 }

 const struct file_operations proc_clear_refs_operations = {

pagemap clear_refs: specify anon or mapped pages to be cleared

Moussa A. Ba posted patch to specify anon or mapped page reference to be cleared

http://marc.info/?l=linux-kernel&m=124882208428609&w=2

and the script to test out this patch:

http://marc.info/?l=linux-kernel&m=124898170201564&w=2

Monday, August 10, 2009

Add some trace events for the page allocator

Mel Gorman has posted series of patches to add trace events for page allocator:

http://marc.info/?l=linux-kernel&m=124991900725530&w=2

Here is How I tested out those patches:

1, make menuconfig as below:

kernel hacking -> Tracers

.config - Linux Kernel v2.6.31-rc5 Configuration
─────────────────────────────────────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────── Tracers ────────────────────────────────────────────┐
│ Arrow keys navigate the menu. selects submenus --->. Highlighted letters are │
│ hotkeys. Pressing includes, excludes, modularizes features. Press to │
│ exit, for Help, for Search. Legend: [*] built-in [ ] excluded module < > │
│ module capable │
│ ┌────────────────────────────────────────────────────────────────────────────────────────────┐ │
│ │ --- Tracers │ │
│ │ [*] Kernel Function Tracer │ │
│ │ [*] Kernel Function Graph Tracer │ │
│ │ [ ] Interrupts-off Latency Tracer │ │
│ │ [ ] Sysprof Tracer │ │
│ │ [ ] Scheduling Latency Tracer │ │
│ │ [*] Trace syscalls │ │
│ │ [ ] Trace boot initcalls │ │
│ │ Branch Profiling (No branch profiling) ---> │ │
│ │ [ ] Trace power consumption behavior │ │
│ │ [ ] Trace max stack │ │
│ │ [*] Trace SLAB allocations │ │
│ │ [ ] Trace workqueues │ │
│ │ [*] Support for tracing block io actions │ │
│ │ [*] enable/disable ftrace tracepoints dynamically │ │
│ │ [*] Kernel function profiler │ │
│ │ [ ] Perform a startup test on ftrace │ │
│ │ [*] Memory mapped IO tracing │ │
│ │ < > Test module for mmiotrace │ │
│ │ < > Ring buffer benchmark stress tester │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────────────────────────┘ │

2, read Documentation/trace/ftrace.txt and add line below in /etc/fstab:

debugfs /sys/kernel/debug debugfs defaults 0 0

3, apply the series of patches in order as git am series of patches

4, enable page allocator tracing events with:

#for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done

5, run post-process script to get the page allocator events tracing data:

trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe

We could use perf user tools under tools/perf/ to enable the page allocator trace events and track events data. read more in Documentation/trace/tracepoint-analysis.txt

Wednesday, August 5, 2009

Move oom_adj to signal_struct discussion

I copy and paste the content here for future reference:

Full discussion thread is here:
http://marc.info/?l=linux-mm&m=124938152105260&w=2

http://marc.info/?l=linux-mm&m=124938156605311&w=2


Minchan Kim  wrote:
> Hmm. I can't understand why it is troublesome.
> I think it's related to moving oom_adj to singal_struct.
> Unfortunately, I can't understand why we have to put oom_adj
> in singal_struct?
>
> That's why I have a question to Kosaki a while ago.

> I can't understand it still. :-(
>
> Could you elaborate it ?
>

Current code is as following
==
  do_each_thread(g,p) {
        ......
        p = badness();

        record p of highest badness.
  }
  p = higest badness thread.

  Scan all threads which shares mm_struct of p. and check oom_adj

==
Assume a process which has 20000 threads. And 1 of thread has OOM_DISABLE.

Then, at worst, this scan will needs
        (1+2+3+....+20000) * (20000-1) scan. (when ignoring other processes)
even with your patch.

This means the kernel wastes enough long time that Cluster-Management-Software can
detetct this as livelock, and do reboot/cluster-fail-over.

Fixing livelock is not the last goal. I (we) would like to reduct stall time
to reasonable level. If we move oom_adj to signal_struct or mm_struct, scan-cost
will be only 20000. No retry at all.

And, if we can use for_each_process() rather than do_each_thread(),
scan-cost will be 1.

(BTW, "signal" struct is bad name I think, it should be "process" struct ;)


Thanks,
-Kame



> > > > > > > > Hi, Kosaki.
> > > > > > > >
> > > > > > > > I am so late to invole this thread.
> > > > > > > > But let me have a question.
> > > > > > > >
> > > > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > > >
> > > > > > > > I am sorry if you already discussed it at last threads.
> > > > > > >
> > > > > > > Not sorry. that's very good question.
> > > > > > >
> > > > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > > > (move oom_adj to mm_struct).
> > > > > > >
> > > > > > > In 2.6.30, OOM logic callflow is here.
> > > > > > >
> > > > > > > __out_of_memory
> > > > > > >   select_bad_process              for each task
> > > > > > >           badness                 calculate badness of one task
> > > > > > >   oom_kill_process                search child
> > > > > > >           oom_kill_task           kill target task and mm shared tasks with it
> > > > > > >
> > > > > > > example, process-A have two thread, thread-A and thread-B and it
> > > > > > > have very fat memory.
> > > > > > > And, each thread have following likes oom property.
> > > > > > >
> > > > > > >   thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > > >   thread-B: oom_adj = 0,           oom_score = very-high
> > > > > > >
> > > > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > > > select the same task. It mean kernel fall in the livelock.
> > > > > > >
> > > > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > > > OOM logic go into livelock.
> > > > > > >
> > > > > > > Is this enough explanation? thanks.
> > > > > > >
> > > >
> > > > The problem resulted from David patch.
> > > > It can solve live lock problem but make a new problem like vfork problem.
> > > > I think both can be solved by different approach.
> > > >
> > > > It's just RFC.
> > > >
> > > > If some process is selected by OOM killer but it have a child of OOM immune,
> > > > We just decrease point of process. It can affect selection of bad process.
> > > > After some trial, at last bad score is drastically low and another process is
> > > > selected by OOM killer. So I think Live lock don't happen.
> > > >
> > > > New variable adding in task struct is rather high cost.
> > > > But i think we can union it with oomkilladj
> > > > since oomkilladj is used to present just -17 ~ 15.
> > > >
> > > > What do you think about this approach ?
> > > >
> > > keeping this in "task" struct is troublesome.
> > > It may not livelock but near-to-livelock state, in bad case.
> >
> > Hmm. I can't understand why it is troublesome.
> > I think it's related to moving oom_adj to singal_struct.
> > Unfortunately, I can't understand why we have to put oom_adj
> > in singal_struct?
> >
> > That's why I have a question to Kosaki a while ago.
> > I can't understand it still. :-(
> >
> > Could you elaborate it ?
>
> Maybe, It's because my explanation is still poor. sorry.
> Please give me one more chance.
>
> In my previous mail, I explained select_bad_process() must not
> unkillable task, is this ok?
> IOW, if all thread have the same oom_adj, the issue gone.
>
> signal_struct is shared all thread in the process. then, the issue gone.
>
>

Your and Kame's good explanation opens my eyes. :)
I realized your approach's benefit.

Yes. Let's wait to listen others's opinios.

Vincent Tech Blog