Friday, July 24, 2009

gfp_zone analysis

I have three zones on my system, DMA, NORMAL, HIGHMEM, Let's figure out how gfp_zone works:

assume that the allocation flags is 0x421 which can be tranlate to:

__GFP_DMA | GFP_HIGH | GFP_REPEAT

which means allocate memory from ZONE_DMA, Will gfp_zone be able to get ZONE_DMA from gfp flags?

Let's continue:

#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)

The GFP_ZONEMASK on my system would be:

GFP_ZONEMASK = 0x03


We have total three zones, so

ZONES_SHIFT = 0x02

GFP_ZONE_TABLE would be:
2 << 0x02 =33=100001

bit = (__GFP_DMA | __GFP_HIGH | __GFP_REPEAT ) & 0x03 = 0x01


static inline enum zone_type gfp_zone(gfp_t flags)
{
enum zone_type z;
int bit = flags & GFP_ZONEMASK;

z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
((1 <<>> bit) & 1);
else {
#ifdef CONFIG_DEBUG_VM
BUG_ON((GFP_ZONE_BAD >> bit) & 1);
#endif
}
return z;
}


z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) & ((1 << ZONES_SHIFT) - 1)
= (100001 >> (0x01 * 0x02)) & (( 1 << 0x02 -1 )
= 1000 & 0011
= 0
= ZONE_DMA

Thursday, July 23, 2009

Chat about git on how to apply local custom patch on top of mainline master branch

(09:35:49) vincentinsz: hm, question about git, now my kernel git repo is 2.6.31-rc3, and i git branched a test branch and committed custom patch, so the test branch is 2.6.31-rc3 + custom patch
(09:36:47) vincentinsz: now the mainline kernel is 2.6.31-rc4, i git checkout master and git pull to sync to the mainline kernel
(09:37:26) vincentinsz: but how I sync my test branch so it would be 2.6.31-rc4 + custom branch?
(09:38:05) qunying: is your branch a direct checkout from the branch point or from the mainline
(09:43:23) vincentinsz: say linus is the public git repo, here is my working step: 1, git clone linus, 2, git pull (from time to time), 3, git branch test, 4, git commit custom patch, 5 git checkout master, 6 git pull ( new kernel tag released say -rc4), now how I let branch test sync to -rc4 + custom patch?
(09:44:44) qunying: normally you don't you git pull, as it will automatically merge
(09:45:12) qunying: use git fetch then git rebase origin
(09:45:48) qunying: will move your local commits to the top
(09:45:50) vincentinsz: but the merge only touch master branch, not test branch, I care about the custom patch in test branch, not master branch
(09:46:42) qunying: then in test branch, you rebase against master
(09:46:52) qunying: git rebase master
(09:47:45) vincentinsz: then the custom patch would be on top of master?
(09:47:57) qunying: ya
(09:48:45) qunying: it will bring you branch to the latest master + your own commit on top
(09:49:21) vincentinsz: ah, that is it
(09:52:17) vincentinsz: My thought is that I would never touch my local master branch except git pull to sync to linus public git repo, I only test custom patch on a local test branch and also would like to have the test branch sync to master with custom patch on top of it
(09:52:53) vincentinsz: so I would never ruin my local master branch
(09:53:46) vincentinsz: based on the idea that i would never run git clone again :-)
(09:53:54) vincentinsz: reasonable?
(09:55:13) qunying: ya
(09:57:49) vincentinsz: there is git stash, but it seems only save non-committed custom patch and reapply the patch on top
(09:58:04) qunying: ya
(10:06:24) vincentinsz: hm, interesting, it seems I can not save the gaim chat log any more to other text file
(10:06:34) vincentinsz: like copy and paste
(10:06:53) qunying: that is strange
(10:07:39) vincentinsz: I could highlight all the text, right click, there is copy option, but it wont be saved in memory
(10:08:14) vincentinsz: there is save as option in conversation menu, but it only save as html format, annoying
(10:08:40) qunying: ya
(10:09:04) vincentinsz: same to you?
(10:09:32) qunying: never try it, i always let gaim save its own
(10:09:58) qunying: mime is working fine
(10:10:18) vincentinsz: I would like to have the technical discussion posted on my personal blog, so I can always looked it up when I need it :)
(10:11:17) qunying: it works for me, probably i am using a newer version
(10:11:52) qunying: try just highlight the text, and use middle-key to paste on other program
(10:13:24) vincentinsz: I have no middle key on mouse, it is scroll key
(10:14:56) qunying: it is the same, press it like the other will do

Wednesday, July 15, 2009

include/linux/gfp.h


0x00u 0 -> __GFP_NORMAL
0x01u 1 -> __GFP_DMA
0x02u 10 -> __GFP_HIGHMEM
0x04u 100 -> __GFP_DMA32
0x08u 1000 -> __GFP_MOVABLE
0x0fu 1111 -> __GFP_ZONEMASK
0x10u 10000 -> __GFP_WAIT
0x20u 100000 -> __GFP_HIGH
0x40u 1000000 -> __GFP_IO
0x80u 10000000 -> __GFP_FS
0x100u 100000000 -> __GFP_COLD

./scripts/gfp-translate to translate VM oops GFP flag hex code, for example 0x4020 would be
__GFP_COMP | __GFP_HIGH


Friday, July 10, 2009

The heart of zoned buddy allocator

__alloc_pages_nodemask
-> get_page_from_freelist (first attempt)
->__alloc_pages_slowpath (enter slow path allocation)
->wake_all_kswapd (wake up background page reclaiming to free pages)
->get_page_from_freelist (try again, got no page? continue)
-> __alloc_pages_high_priority ( if ALLOC_NO_WATERMARKS, try this one)
->__alloc_pages_direct_reclaim (enter direct page reclaim )
-> get_page_from_freelist (still got no page, and direct reclaim make no progrogress)
->__alloc_pages_may_oom (enter OOM to kill some task to free some pages)




Wednesday, July 8, 2009

Analysis of shrink_slab function in mm/vmscan.c

The code snippet is referenced from 2.6.31-rc2

184 #define SHRINK_BATCH 128
185 /*
186 * Call the shrink functions to age shrinkable caches
187 *
188 * Here we assume it costs one seek to replace a lru page and that it also
189 * takes a seek to recreate a cache object. With this in mind we age equal
190 * percentages of the lru and ageable caches. This should balance the seeks
191 * generated by these structures.
192 *
193 * If the vm encountered mapped pages on the LRU it increase the pressure on
194 * slab to avoid swapping.
195 *
196 * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
197 *
198 * `lru_pages' represents the number of on-LRU pages in all the zones which
199 * are eligible for the caller's allocation attempt. It is used for balancing
200 * slab reclaim versus page reclaim.
201 *
202 * Returns the number of slab objects which we shrunk.
203 */
204 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
205 unsigned long lru_pages)
206 {
207 struct shrinker *shrinker;
208 unsigned long ret = 0;
209
210 if (scanned == 0)
211 scanned = SWAP_CLUSTER_MAX;
212
213 if (!down_read_trylock(&shrinker_rwsem))
214 return 1; /* Assume we'll be able to shrink next time */
215
216 list_for_each_entry(shrinker, &shrinker_list, list) {
217 unsigned long long delta;
218 unsigned long total_scan;
219 unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
220
221 delta = (4 * scanned) / shrinker->seeks;
222 delta *= max_pass;
223 do_div(delta, lru_pages + 1);
224 shrinker->nr += delta;
225 if (shrinker->nr < nr="%ld\n">shrink, shrinker->nr);
229 shrinker->nr = max_pass;
230 }
231
232 /*
233 * Avoid risking looping forever due to too large nr value:
234 * never try to free more than twice the estimate number of
235 * freeable entries.
236 */
237 if (shrinker->nr > max_pass * 2)
238 shrinker->nr = max_pass * 2;
239
240 total_scan = shrinker->nr;
241 shrinker->nr = 0;
242
243 while (total_scan >= SHRINK_BATCH) {
244 long this_scan = SHRINK_BATCH;
245 int shrink_ret;
246 int nr_before;
247
248 nr_before = (*shrinker->shrink)(0, gfp_mask);
249 shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
250 if (shrink_ret == -1)
251 break;
252 if (shrink_ret <>nr += total_scan;
261 }
262 up_read(&shrinker_rwsem);
263 return ret;
264 }



Line 204: shrink_slab gets called multiple places, with cscope ctrl+\+c, we get:


1 61 fs/drop_caches.c <>
nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
2 1697 mm/vmscan.c <>
shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
3 1937 mm/vmscan.c <>
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
4 2193 mm/vmscan.c <>
shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
5 2229 mm/vmscan.c <>
shrink_slab(sc.nr_scanned, sc.gfp_mask,
6 2247 mm/vmscan.c <>
shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
7 2454 mm/vmscan.c <<__zone_reclaim>>
while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&

Tracing back to the calling function, We will see that scanned parameter refer to the scanned LRU pages,
sc->nr_scanned, lru_pages refer to total LRU pages in zones.

Line 216 - 261 loop through shrinker list to shrink slab caches
Line 219 get the maximum shrink cache sizes, See include/linux/mm.h

862 /*
863 * A callback you can register to apply pressure to ageable caches.
864 *
865 * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'. It should
866 * look through the least-recently-used 'nr_to_scan' entries and
867 * attempt to free them up. It should return the number of objects
868 * which remain in the cache. If it returns -1, it means it cannot do
869 * any scanning at this time (eg. there is a risk of deadlock).
870 *
871 * The 'gfpmask' refers to the allocation we are currently trying to
872 * fulfil.
873 *
874 * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
875 * querying the cache size, so a fastpath for that case is appropriate.
876 */
877 struct shrinker {
878 int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
879 int seeks; /* seeks to recreate an obj */
880
881 /* These are for internal use */
882 struct list_head list;
883 long nr; /* objs pending delete */
884 };


Line 221 - 224 get the pending shrink object numbers, It maches the code comment above about the
"age equal percentages of the lru and ageable caches"

Line 243 - 258 do batch of SHRINK_BATCH scanning and accumulating shrinked objects to ret variable

Line 263 return the shrinked slab cache objects

A sample git work flow to send/receive patch by email

I googled and tried couple of git work flows to send/receive trivial kernel patches, Here is my summary:



###################################################################
# References
http://linux.yyz.us/git-howto.html
http://www.kernel.org/pub/software/scm/git/docs/git-format-patch.html
http://www.kernel.org/pub/software/scm/git/docs/git-send-email.html


# One time commands

> apt-get install git-email
> cd /usr/src
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
linux-2.6
> cd /usr/src/linux-2.6
> git config --global user.name "Vincent Li"
> git config --global user.email "username@example.com"
> git config --global sendemail.smtpserver smtp.example.com
> git config --global sendemail.smtpserverport 587
> git config --global sendemail.smtpuser username
> git config --global sendemail.smtppass userpass

update: ( I did for my gmail account below)

199 apt-get install git-email
203 git config --global sendemail smtpserver smtp.gmail.com
205 cd .git
210 git branch
212 git config --global sendemail.smtpserver smtp.gmail.com
213 git config --global sendemail.smtpserverport 587
214 git config --global sendemail.smtpencryption tls
215 git config --global sendemail.smtpuser myusername@gmail.com
216 git config --global sendemail.smtppass xxxxx
233 git config --global user.name "Vincent Li"
234 git config --global user.email "myusername@gmail.com"



# Make a new branch for the patch you're doing. In this case, I'll do replacing BUG_ON with VM_BUG_ON in mm/vmscan.c

> git checkout -b experimental

# Now edit the file

> perl -pi -e 's/^(\t+)BUG_ON/$1VM_BUG_ON/g' mm/vmscan.c
> git commit -a

# Put in a simple message of a line or two.

Trivial: Replace BUG_ON with VM_BUG_ON for consistency

VM subsystem use VM_BUG_ON to test likely bug situation,mm/vmscan.c still have three BUG_ON left, Replacing it with VM_BUG_ON for code consistency.

# Now exit the editor

# Check the commit, which is the most recent one by default

> git log -1

# See the actual patch with:
> git diff master..HEAD

commit 64ea153753811970563ecf5938a8a87c54336495
Author: Vincent Li
Date: Wed Jul 8 10:17:37 2009 -0700

Trivial: Replace BUG_ON with VM_BUG_ON for consistency

VM subsystem use VM_BUG_ON to test likely bug situation,mm/vmscan.c still have three BUG_ON left, Replacing it with VM_BUG_ON for code consistency.


#If you want to format a single commit with signed off, you can do this with "git format-patch -1 -s ".

> git format-patch -1 -s 64ea153
0001-Trivial-Replace-BUG_ON-with-VM_BUG_ON-for-consisten.patch

# Now look the patch over and see if you need to edit the subject or anything


# Now do a dry run to send the email
> git send-email --dry-run --to=username@example.com
0001-Trivial-Replace-BUG_ON-with-VM_BUG_ON-for-consisten.patch

# Looks good, send for real
> git send-email --to=linux-kernel@vger.kernel.org
0001-Trivial-Replace-BUG_ON-with-VM_BUG_ON-for-consisten.patch

#I use alpine as email client, save the email as mbox single file, for example /tmp/trivial.patch, now use git am to apply the patch

> git checkout master

#git-am refuses to process new mailboxes while the .git/rebase-apply directory exists, so if you decide to start over from scratch,
run rm -f -r .git/rebase-apply before running the command with mailbox names.
> rm -f -r .git/rebase-apply
> git am /tmp/trivial.patch
Applying: Trivial: Replace BUG_ON with VM_BUG_ON for consistency

> git log -1

commit 9ba28a665d0a642f9bfda54a6ffedb8c0e8dbd8b
Author: Vincent Li
Date: Wed Jul 8 10:35:08 2009 -0700

Trivial: Replace BUG_ON with VM_BUG_ON for consistency

VM subsystem use VM_BUG_ON to test likely bug situation,mm/vmscan.c
still have three BUG_ON left, Replacing it with VM_BUG_ON for code consistency.

Signed-off-by: Vincent Li

# Now if you are not happy with the patch, and don't want it in history, reset master branch with

>git reset --hard HEAD^



That is my sample git work flow, of course you can merge your experimental branch patch with master branch with git merge, I just showed you the way to format/send/receive/apply patch by email, since from time to time, you may need to send out trivial patch and test out other's patch as system administrator, not full time programmer.

more info on how to submit multiple patches from linke below:

http://www.spinics.net/lists/newbies/msg44250.html

For example:

Create a local branch for a tree:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/sfr/linux-next.git
$ cd linux-next
$ git checkout -b devel origin/master

Do some change and commit:

$ emacs drivers/staging/pohmelfs/dir.c
$ git add drivers/staging/pohmelfs/dir.c
$ git commit -m "Staging: pohmelfs/dir.c: Fix something"

Do another change and commit:

$ emacs drivers/staging/pohmelfs/dir.c
$ git commit -m "Staging: pohmelfs/dir.c: Fix another thing"

Generate your patchset with your last two commits

$ git format-patch -s -2

This will create one file for each patch generated.

So to send your patchset you can use the command:

$ git send-email --compose --to='Zac Storer '
--cc='kernelnewbies@xxxxxxxxxxxxxxxxx' *.patch

The command will extract the commit message and use it as the mail
subject, with the --compose flag you can create a prelude mail
explaining your patchset.

So this command will create 3 mails with these subjects

[PATCH 0/2] Staging: pohmelfs/dir.c: Fixes
[PATCH 1/2] Staging: pohmelfs/dir.c: Fix something
[PATCH 2/2] Staging: pohmelfs/dir.c: Fix another thing

Also you can be sure that your email client didn't wrap lines and the
message era encoded in ASCII.

Remember always to use scripts/checkpatch.pl to check your patches and
scripts/get_maintainer.pl to check who are the developers that have to
be cc'ed.

Friday, July 3, 2009

Direct Page reclaim and Background Page reclaim call path

Direct Page Reclaim call path

__get_free_pages ->
alloc_pages ->
alloc_pages_nodemask ->
__alloc_pages_slowpath ->
__alloc_pages_direct_reclaim ->
try_to_free_pages ->
do_try_to_free_pages ->
shrink_slab/shrink_zones ->
shrink_zone ->
shrink_list ->
shrink_inactive/active_list ->
shrink_page_list ->
page_out ->
|
V
mapping->a_ops->writepage


Background Page Reclaim call path


wakeup_kswapd ->
kswapd ->
balance_pgdat ->
shrink_slab/shrink_zone ->
shrink_list ->
shrink_inactive/active_list ->
shrink_page_list ->
page_out ->
|
V

mapping->a_ops->writepage



Note: Pages is moved from active list to inactive list for freeing in the end.
, shrink_active_list move pages to inactive list, when moving pages, pages are
isolated from lru list to a private list (page_list or l_hold).

Thursday, July 2, 2009

VM_BUG_ON(PageLRU(page) and VM_BUG_ON(!PageLRU(page) in mm/vmscan.c

My confusion about VM_BUG_ON(!PageLRU(page) vs VM_BUG_ON(PageLRU(page)

http://zh-kernel.org/pipermail/linux-kernel/2009-June/011552.html

kernel virtual address caculation


I had an interesting chat with my friend qunying about how to caculate the hex presentation of address to a size:

(14:29:56) vincentinsz: on x86 32bit the kernel image located at physical address 1MiB, which translate to 0x00100000, but how 0x00100000 equals 1M, how to caculate it?
(14:32:00) qunying: 1024*1024 = 0x100000
(14:32:11) qunying: that is 1MiB
(14:33:56) vincentinsz: is there easy way to see 0x100000 as 1024 * 1024?
(14:34:18) qunying: just count the zeros
(14:34:36) qunying: one 0 in hex is 2^^2
(14:34:49) qunying: there is 5 zero, that is 2^^10
(14:35:00) qunying: that is 1M
(14:35:54) qunying: one 0 is 2^^4
(14:35:58) qunying: not 2,
(14:36:09) qunying: 5 zero is 2^^20
(14:37:59) vincentinsz: how do you get one 0 is 2^^4?
(14:38:27) qunying: for one number in hex represents 4 bits in binary
(14:38:55) qunying: 0x10 = 2^4, 0x100 = 2^8, etc
(14:40:01) vincentinsz: 0x10 = 1000 0000
(14:40:13) qunying: ya
(14:40:23) qunying: no
(14:40:29) qunying: 001 000
(14:40:34) qunying: 0001 0000
(14:40:47) vincentinsz: i see
(14:42:20) vincentinsz: what about some other hex address like 0xC0000000 which is about 3G, How to caculate
(14:45:10) vincentinsz: so there is 7 0s which is 2^^28?
(14:45:24) qunying: ya
(14:45:35) qunying: C is 1100
(14:46:00) qunying: so times 2^12
(14:46:35) vincentinsz: ah, so 2^^30 * 3?
(14:48:33) qunying: ya


---

Return page_count(page) - !!page_has_private(page) == 2 discussion


(15:34:44) vincentinsz: 287 static inline int is_page_cache_freeable(struct page *page)
288 {
289 return page_count(page) - !!page_has_private(page) == 2;
290 }

(15:35:21) vincentinsz: this function eventually returns 0 or 1, right?
(15:36:35) qunying: not understand it fully, looks strange to me
(15:38:20) qunying: as !!page_has_private(page) should return 0 or 1, and !!page_has_private(page) == 2 should always fail, then that is the result of page_count(page)
(15:40:02) vincentinsz: I thought it is like return 3 - 1 == 2? 1 : 0 ?
(15:40:45) qunying: ah right, forgot the '-'
(15:41:11) qunying: it is always return 0 or 1
(15:42:52) vincentinsz: not sure why number 2 is special in this case ==2
(15:43:13) vincentinsz: why not == 1, or == 3 ?
(15:44:00) qunying: that is beyond my understanding, you make dig into how page_count is working
(15:53:00) vincentinsz: what the !! is for, like !!func(a), always get the oposite of function retuning value?
(15:53:47) qunying: not, it normalize the return code to 0 or 1
(15:54:02) qunying: some func(a0 may return > 1 or <> 1 to make it 1, <> 1 to make it 1, 0 to make it 0
(15:57:25) qunying: ya
(15:57:35) vincentinsz: f**k :-) so <> 0 to make it 1
(16:58:12) qunying logged out.


---


(09:31:33) vincentinsz: Hi, still to the strange !!((page)->flags & ((1 <<>flags & ((1 <<>flags to something like 00010000, assuming the 1 bit value represents the PG_private, am I right?
(09:34:46) qunying: yes
(09:35:33) vincentinsz: then !!(0000100000) make it to vaule 1, right?
(09:35:59) qunying: yes
(09:38:03) vincentinsz: someone else had this explaintion: [url=http://zh-kernel.org/pipermail/linux-kernel/2009-June/011228.html]http://zh-kernel.org/pipermail/linux-kernel/2009-June/011228.html[/url]
(09:38:32) vincentinsz: is that the same thing as you said?
(09:39:27) qunying: ya
(09:40:13) vincentinsz: is that to say that !! will always get 1?
(09:40:36) qunying: no, it says none 0 value to 1
(09:40:49) qunying: 0 will always get 0
(09:51:05) vincentinsz: ok. Oh and the page_count(page) - !!((page)->flags & ((1 <<>flag PG_private is set, then there should be another two bit set to 1 in (page)->flags so that this page can be freeable
(09:51:53) qunying: i see
(09:53:06) vincentinsz: the other two bit could mean a page is in user mapped address space and LRU (Least recently used) list which are most likely for page reclaim candidate

(10:03:46) vincentinsz: There are many details, I could be wrong :-), the devil is the detail
(10:04:19) qunying: ^_6
(10:48:29) vincentinsz: ok, more, page_count(page) count the reference count of page, if page flag PG_private bit flag is set, the page is pagecache page backed by inode or swap , so the pagecache itself would have 1 reference to the page, that is at least 2 ref count. Then the page has to be referenced in LRU list so it can be freed, that is 3.
(10:51:11) qunying: hmm, that is why it minors the 1 reference from private bit reference
(10:51:16) qunying: minus

and my question in zh-kernel mailing list


http://zh-kernel.org/pipermail/linux-kernel/2009-June/011426.html


Johannes Weiner has patched this function with comments to make it clear


http://marc.info/?l=linux-mm&m=124830074212169&w=2



I had a chat on #mm channel with hnaz about task_struct's member children and sibling list head:



* Now talking on #mm
* Topic for #mm is: Memory Management - http://linux-mm.org/
* Topic for #mm set by ChanServ!services@services.oftc.net at Fri May 22 01:44:02 2009
macli I am newbie, and reading mm/oom_kill.c, wondering why list_for_each_entry(child, &p->children, sibling) in badness(), not list_for_each_entry(child, &p->children, children)?

hnaz macli: it's the linkname. task->children is the head of a list that is linked by task->sibling

macli hnaz: I see #define list_for_each_entry(pos, head, member) in list.h, where I can find the code that task->children ,the list head which is linked by task->sibling, I see struct task_struct has two list_head children and silbling

macli struct list_head children; /* list of my children */

macli struct list_head sibling; /* linkage in my parent's children list */

macli I am assuming that children is the list head of a task's children list

macli sibling is the list head of a task's parent's children list which is different with the children list head, that is my understanding of reading the comment

hnaz macli: you have to understand that a 'list_head' is at the same time a node. it represents one link in the list

hnaz macli: children is the link to other task structs that represent the children

hnaz macli: while sibling is the link to chain up a task as part of another task's children list

macli hnaz: I see, thanks for the explaintion

Followers