This patch adds
vmalloc_node(size, node) -> Allocate necessary memory on the specified node
and
get_vm_area_node(size, flags, node)
and the other functions that it depends on.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Beginning of gfp_t annotations:
- -Wbitwise added to CHECKFLAGS
- old __bitwise renamed to __bitwise__
- __bitwise defined to either __bitwise__ or nothing, depending on
__CHECK_ENDIAN__ being defined
- gfp_t switched from __nocast to __bitwise__
- force cast to gfp_t added to __GFP_... constants
- new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts
the result to int
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NUMA counters in struct per_cpu_pageset (linux/mmzone.h) are never
cleared today. This works ok for CPU 0 on NUMA machines because
boot_pageset[] is already zero, but for other CPU:s this results in
uninitialized counters.
Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 3359b54c8c07338f3a863d1109b42eebccdcf379 and
replaces it with a cleaner version that is purely based on page table
operations, so that the synchronization between inode size and hugetlb
mappings becomes moot.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This introduces a limit parameter to the core bootmem allocator; The new
parameter indicates that physical memory allocated by the bootmem
allocator should be within the requested limit.
We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit,
alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit
is the only api used for swiotlb.
The existing alloc_bootmem_low_pages() api could instead have been
changed and made to pass right limit to the core allocator. But that
would make the patch more intrusive for 2.6.14, as other arches use
alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a
cleanup.
With this, swiotlb gets memory within 4G for both x86_64 and ia64
arches.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
forgets that: crashes and misaccounting ensue.
copy_hugetlb_page_range better grab the src page_table_lock since we don't
want to guess what happens if concurrently truncated. unmap_hugepage_range
rss accounting must not assume the full range was mapped. follow_hugetlb_page
must guard with page_table_lock and be prepared to exit early.
Restyle copy_hugetlb_page_range with a for loop like the others there.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The hugetlb pages are currently pre-faulted. At the time of mmap of
hugepages, we populate the new PTEs. It is possible that HW has already
cached some of the unused PTEs internally. These stale entries never
get a chance to be purged in existing control flow.
This patch extends the check in page fault code for hugepages. Check if
a faulted address falls with in size for the hugetlb file backing it.
We return VM_FAULT_MINOR for these cases (assuming that the arch
specific page-faulting code purges the stale entry for the archs that
need it).
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
[ This is apparently arguably an ia64 port bug. But the code won't
hurt, and for now it fixes a real problem on some ia64 machines ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As noticed by Nick Piggin, we need to make sure that we check the page
count before we check for PageDirty, since the dirty check is only valid
if the count implies that we're the only possible ones holding the page.
We always did do this, but the code needs a read-memory-barrier to make
sure that the orderign is also honored by the CPU.
(The writer side is ordered due to the atomic decrement and test on the
page count, see the discussion on linux-kernel)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Refuse to install a page into a mapping if the mapping count is already
ridiculously large.
You probably cannot trigger this on 32-bit architectures, but on a
64-bit setup we should protect against it.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Revert this recent correctness change: Douglas Crosher <dcrosher@scieneer.com>
reported that it broke an existing application, and that madvise() works
without error on anonymous mappings on Solaris.
This means that madvise() will remain non-standards-compliant: we should
return -EBADF for all requests against non-file-backed vma's, but Linux only
does this for MADV_WILLNEED requests.
Signed-off-by: Suzuki K P <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As requested by Thomas Gleixner <tglx@linutronix.de>:
"5d3d0f7704ed0bc7eaca0501eeae3e5da1ea6c87 breaks a couple of ARM
boards, which depend on the historical bootmem allocation order.
There is a cleaner solution around to remove the pgdat list
completely, but this is a topic for post 2.6.14
Andi signalled ACK already."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In kmalloc_node we are checking if the allocation is for the same node when
interrupts are "on". This may lead to an allocation on another node than
intended.
This patch just shifts the check for the current node in __cache_alloc_node
when interrupts are disabled.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the ZERO_PAGE remapping complexity to the move_pte macro in
asm-generic, have it conditionally depend on
__HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS.
For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes
a noop.
From: Hugh Dickins <hugh@veritas.com>
Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch.
The "pte" at that point may be a swap entry or a pte_file entry: we must
check pte_present before perhaps corrupting such an entry.
Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's
mm/mremap.c, and more dangerous there since it's affecting all arches: I
think the safest course is to send Nick's patch and Yoichi's build fix and
this fix (build tested) on to Linus - so only MIPS can be affected.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As davem points out, this wasn't such a great idea. There may be some code
which does:
size = 1024*1024;
while (kmalloc(size, ...) == 0)
size /= 2;
which will now explode.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Problem: In some circumstances, bd_claim() is returning the wrong error
code.
If we try to swapon an unused block device that isn't swap formatted, we
get -EINVAL. But if that same block device is already mounted, we instead
get -EBUSY, even though it still isn't a valid swap device.
This issue came up on the busybox list trying to get the error message
from "swapon -a" right. If a swap device is already enabled, we get -EBUSY,
and we shouldn't report this as an error. But we can't distinguish the two
-EBUSY conditions, which are very different errors.
In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy
means "somebody other than sys_swapon has already claimed this", and
_that_ means this block device can't be a valid swap device. So return
-EINVAL there.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I had an issue on ia64 where I got a bug in kernel/workqueue because
kzalloc returned a NULL pointer due to the task structure getting too big
for the slab allocator. Usually these cases are caught by the kmalloc
macro in include/linux/slab.h.
Compilation will fail if a too big value is passed to kmalloc.
However, kzalloc uses __kmalloc which has no check for that. This patch
makes __kmalloc bug if a too large entity is requested.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The numa slab allocator may allocate pages from foreign nodes onto the
lists for a particular node if a node runs out of memory. Inspecting the
slab->nodeid field will not reflect that the page is now in use for the
slabs of another node.
This patch fixes that issue by adding a node field to free_block so that
the caller can indicate which node currently uses a slab.
Also removes the check for the current node from kmalloc_cache_node since
the process may shift later to another node which may lead to an allocation
on another node than intended.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is essential that index_of() be inlined. But alpha undoes the gcc
inlining hackery and index_of() ends up out-of-line. So fiddle with things
to make that function inline again.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh made me note this line for permission checking in mprotect():
if ((newflags & ~(newflags >> 4)) & 0xf) {
after figuring out what's that about, I decided it's nasty enough. Btw
Hugh itself didn't like the 0xf.
We can safely change it to VM_READ|VM_WRITE|VM_EXEC because we never change
VM_SHARED, so no need to check that.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
That comment is plain wrong (we even take the pagetable lock inside
unmap_region()).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With the new changes that we made in the initialization of the slab
allocator, we first setup the cache from which array caches are allocated,
and then the cache, from which kmem_list3's are allocated.
Now if the array cache comes from a cache in which objsize > 32, (in this
instance size-64) then, first size-64 cache will be allocated and then the
size-128 (if this is the cache from which kmem_list3's are going to be
allocated).
So with these new changes, we are not guaranteed that we will be
initializing the malloc_sizes array in a serialized order. Thus there is
a bug in __find_general_cachep, as we are checking whether the first
cache_sizes ptr is NULL.
This is replaced by checking whether the array-cache cache is initialized.
Attached is a patch which does that. Boots fine on a x86-64, with
DEBUG_SPIN, DEBUG_SLAB, and preempt.
Attached is a patch which does that. Boots fine on a x86-64, with
DEBUG_SPIN, DEBUG_SLAB, and preempt.Thanks & Regards, Alok
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhitdayal.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pavel Emelianov and Kirill Korotaev observe that fs and arch users of
security_vm_enough_memory tend to forget to vm_unacct_memory when a
failure occurs further down (typically in setup_arg_pages variants).
These are all users of insert_vm_struct, and that reservation will only
be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some
cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't.
So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into
Committed_AS each time they're run. But don't add VM_ACCOUNT to them,
it's inappropriate to reserve against the very unlikely case that gdb
be used to COW a vDSO page - we ought to do something about that in
do_wp_page, but there are yet other inconsistencies to be resolved.
The safe and economical way to fix this is to let insert_vm_struct do
the security_vm_enough_memory check when it finds VM_ACCOUNT is set.
And the MIPS irix_brk has been calling security_vm_enough_memory before
calling do_brk which repeats it, doubly accounting and so also leaking.
Remove that, and all the fs and arch calls to security_vm_enough_memory:
give it a less misleading name later on.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the add_taint() interface for setting tainted bit flags instead of
doing it manually.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There was a pretty bad bug in there that the code would always check the full
VMA, not the range the user requested.
When the VMA to be checked was merged with the previous VMA this could lead to
spurious failures.
Signed-off-by: "Andi Kleen" <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the pgdat pointer we've already defined in wakeup_kswapd
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This leads to bootmem allocating first from node 0 instead
of from the last node. This avoids swiotlb allocating on the last node, which
doesn't really work on a machine with >4GB.
Note: there is a better patch around from someone else that gets
rid of the pgdat list completely.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move call to get_mm_counter() in update_mem_hiwater() to be
inside the check for tsk->mm being null. Otherwise you can be
following a null pointer here. This patch submitted by
Javier Herrero <jherrero@hvsistemas.es>.
Modify the end check for munmap regions to allow for the
legacy behavior of 0 being valid. Pretty much all current
uClinux system libc malloc's pass in 0 as the end point.
A hard check will fail on these, so change the check so
that if it is non-zero it must be valid otherwise it fails.
A passed in value will always succeed (as it used too).
Also export a few more mm system functions - to be consistent
with the VM code exports.
Signed-off-by: Greg Ungerer <gerg@uclinux.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the sparse warning "implicit cast to nocast type"
Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the sparse warning "implicit cast to nocast type"
Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With Nick Piggin <npiggin@suse.de>
Give some things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up timer initialization by introducing DEFINE_TIMER a'la
DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been
been in the -RT tree for some time.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch clarifies NULL handling of kfree() and vfree(). I addition,
wording of calling context restriction for vfree() and vunmap() are changed
from "may not" to "must not."
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NUMA API change that introduced kmalloc_node was accepted for
2.6.12-rc3. Now it is possible to do slab allocations on a node to
localize memory structures. This API was used by the pageset localization
patch and the block layer localization patch now in mm. The existing
kmalloc_node is slow since it simply searches through all pages of the slab
to find a page that is on the node requested. The two patches do a one
time allocation of slab structures at initialization and therefore the
speed of kmalloc node does not matter.
This patch allows kmalloc_node to be as fast as kmalloc by introducing node
specific page lists for partial, free and full slabs. Slab allocation
improves in a NUMA system so that we are seeing a performance gain in AIM7
of about 5% with this patch alone.
More NUMA localizations are possible if kmalloc_node operates in an fast
way like kmalloc.
Test run on a 32p systems with 32G Ram.
w/o patch
Tasks jobs/min jti jobs/min/task real cpu
1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005
100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005
200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005
300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005
400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005
500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005
600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005
700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005
800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005
900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005
1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005
w/patch
Tasks jobs/min jti jobs/min/task real cpu
1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005
100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005
200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005
300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005
400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005
500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005
600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005
700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005
800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005
900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005
1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005
These are measurement taken directly after boot and show a greater
improvement than 5%. However, the performance improvements become less
over time if the AIM7 runs are repeated and settle down at around 5%.
Links to earlier discussions:
http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2
Changelog V4-V5:
- alloc_arraycache and alloc_aliencache take node parameter instead of cpu
- fix initialization so that nodes without cpus are properly handled.
- simplify code in kmem_cache_init
- patch against Andrews temp mm3 release
- Add Shai to credits
- fallback to __cache_alloc from __cache_alloc_node if the node's cache
is not available yet.
Changelog V3-V4:
- Patch against 2.6.12-rc5-mm1
- Cleanup patch integrated
- More and better use of for_each_node and for_each_cpu
- GCC 2.95 fix (do not use [] use [0])
- Correct determination of INDEX_AC
- Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes.
- Remove list3_data and list3_data_ptr macros for better readability
Changelog V2-V3:
- Made to patch against 2.6.12-rc4-mm1
- Revised bootstrap mechanism so that larger size kmem_list3 structs can be
supported. Do a generic solution so that the right slab can be found
for the internal structs.
- use for_each_online_node
Changelog V1-V2:
- Batching for freeing of wrong-node objects (alien caches)
- Locking changes and NUMA #ifdefs as requested by Manfred
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch modifies tmpfs to call the inode_init_security LSM hook to set
up the incore inode security state for new inodes before the inode becomes
accessible via the dcache.
As there is no underlying storage of security xattrs in this case, it is
not necessary for the hook to return the (name, value, len) triple to the
tmpfs code, so this patch also modifies the SELinux hook function to
correctly handle the case where the (name, value, len) pointers are NULL.
The hook call is needed in tmpfs in order to support proper security
labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With
this change in place, we should then be able to remove the
security_inode_post_create/mkdir/... hooks safely.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update the file systems in fs/ implementing a delete_inode() callback to
call truncate_inode_pages(). One implementation note: In developing this
patch I put the calls to truncate_inode_pages() at the very top of those
filesystems delete_inode() callbacks in order to retain the previous
behavior. I'm guessing that some of those could probably be optimized.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Run PCI driver initialization on local node
Instead of adding messy kmalloc_node()s everywhere run the
PCI driver probe on the node local to the device.
This would not have helped for IDE, but should for
other more clean drivers that do more initialization in probe().
It won't help for drivers that do most of the work
on first open (like many network drivers)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch introduces a kzalloc wrapper and converts kernel/ to use it. It
saves a little program text.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now the real motivation for this cpuset mem_exclusive patch series seems
trivial.
This patch keeps a task in or under one mem_exclusive cpuset from provoking an
oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only
interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
containment, there is little to gain from oom killing a task under a
non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
allocation must come from disjoint memory nodes.
This patch enables configuring a system so that a runaway job under one
mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
that might be using very high compute and memory resources for a prolonged
time.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution. With this patch, there are now the following four layers of
memory placement available:
1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
3) The current tasks cpuset (GFP_USER allocations constrained to here), and
4) Specific node placement, using mbind and set_mempolicy.
These nest - each layer is a subset (same or within) of the previous.
Layer (2) above is new, with this patch. The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed. The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.
GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.
The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such. Swapper and oom_kill activity is also constrained to Layer (2). A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset. Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.
This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.
Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch series extends the use of the cpuset attribute 'mem_exclusive'
to support cpuset configurations that:
1) allow GFP_KERNEL allocations to come from a potentially larger
set of memory nodes than GFP_USER allocations, and
2) can constrain the oom killer to tasks running in cpusets in
a specified subtree of the cpuset hierarchy.
Here's an example usage scenario. For a few hours or more, a large NUMA
system at a University is to be divided in two halves, with a bunch of student
jobs running in half the system under some form of batch manager, and with a
big research project running in the other half. Each of the student jobs is
placed in a small cpuset, but should share the classic Unix time share
facilities, such as buffered pages of files in /bin and /usr/lib. The big
research project wants no interference whatsoever from the student jobs, and
has highly tuned, unusual memory and i/o patterns that intend to make full use
of all the main memory on the nodes available to it.
In this example, we have two big sibling cpusets, one of which is further
divided into a more dynamic set of child cpusets.
We want kernel memory allocations constrained by the two big cpusets, and user
allocations constrained by the smaller child cpusets where present. And we
require that the oom killer not operate across the two halves of this system,
or else the first time a student job runs amuck, the big research project will
likely be first inline to get shot.
Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project
really does run amuck allocating memory, it should be shot, not some other
task outside the research projects mem_exclusive cpuset.
I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage
such scenarios. Let memory allocations for user space (GFP_USER) be
constrained by a tasks current cpuset, but memory allocations for kernel space
(GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the
current cpuset, even though kernel space allocations will still _prefer_ to
remain within the current tasks cpuset, if memory is easily available.
Let the oom killer be constrained to consider only tasks that are in
overlapping mem_exclusive cpusets (it won't help much to kill a task that
normally cannot allocate memory on any of the same nodes as the ones on which
the current task can allocate.)
The current constraints imposed on setting mem_exclusive are unchanged. A
cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a
mem_exclusive cpuset may not overlap any of its siblings memory nodes.
This patch was presented on linux-mm in early July 2005, though did not
generate much feedback at that time. It has been built for a variety of
arch's using cross tools, and built, booted and tested for function on SN2
(ia64).
There are 4 patches in this set:
1) Some minor cleanup, and some improvements to the code layout
of one routine to make subsequent patches cleaner.
2) Add another GFP flag - __GFP_HARDWALL. It marks memory
requests for USER space, which are tightly confined by the
current tasks cpuset.
3) Now memory requests (such as KERNEL) that not marked HARDWALL can
if short on memory, look in the potentially larger pool of memory
defined by the nearest mem_exclusive ancestor cpuset of the current
tasks cpuset.
4) Finally, modify the oom killer to skip any task whose mem_exclusive
cpuset doesn't overlap ours.
Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32
bytes of kernel text space. Patch (2) has no affect on the size of kernel
text space (it just adds a preprocessor flag). Patches (3) and (4) added
about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which
matters only if CONFIG_CPUSET is enabled.
This patch:
This patch applies a few comment and code cleanups to mm/oom_kill.c prior to
applying a few small patches to improve cpuset management of memory placement.
The comment changed in oom_kill.c was seriously misleading. The code layout
change in select_bad_process() makes room for adding another condition on
which a process can be spared the oom killer (see the subsequent
cpuset_nodes_overlap patch for this addition).
Also a couple typos and spellos that bugged me, while I was here.
This patch should have no material affect.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark variables which are usually accessed for reads with __readmostly.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We don't reset the cache hit count until after readahead does a successful
readahead. This seems to leave a corner case open where we miss in cache,
but don't restart the readhead right away.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move some more frequently read variables that showed up during some of our
performance tests as sometimes ending up in hot cachelines to the
read_mostly section.
Fix: Move the __read_mostly from before hpet_usec_quotient to follow the
variable like the other uses of __read_mostly.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>