Currently when calculating boosted util for a cpu, it uses a fixed
value of 1024 for calculation. So when top-app tasks moved to LC,
which has much lower capacity than BC, the freq calculated will be
high even the cpu util is low. This results in higher power
consumption, especially on arch which has more little cores than
big cores. By replacing the fixed value of 1024 with actual cpu
capacity will reduce the freq calculated on LC.
Bug: 152925197
Test: boosted util reduced on little cores
Signed-off-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I80cdd08a2c7fa5e674c43bfc132584d85c14622b
PELT doesn't account for real time task utilization in cpu_util().
As the result a CPU busy running RT task is considered as low
utilization by the scheduler. Fix this by adding real time loading
in to account.
Bug: 147385228
Test: boot to home and run audio test
Change-Id: Ie4412b186608b9a618f0d35cee9a7310db481f7c
Signed-off-by: Kyle Lin <kylelin@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Commit 20017f3383 ("sched/fair: Only kick nohz balance when runqueue
has more than 1 task") disabled the nohz kick for LB when a rq has a
misfit task. The assumption is that this would be addressed in the
forced up-migration path. However, this path is WALT-specific, so
disabling the nohz kick breaks PELT.
Fix it by re-enabling the nohz_kick when there is a misfit task on the
rq.
Bug: 143472450
Test: 10/10 iterations of eas_small_to_big ended up up-migrating
Fixes: 20017f3383 ("sched/fair: Only kick nohz balance when runqueue
has more than 1 task")
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I9f708eb7661a9e82afdd4e99b878995c33703a45
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
update_cpu_capacity will update cpu_capacity_orig capped with
thermal_cap, in non-WALT case, thermal_cap is previous
cpu_capacity_orig. This caused cpu_capacity_orig being capped
incorrectly.
Test: Build
Bug: 144143594
Change-Id: I1ff9d9c87554c2d2395d46b215276b7ab50585c0
Signed-off-by: Wei Wang <wvw@google.com>
(cherry picked from commit dac65a5a494f8d0c80101acc5d482d94cda6f158)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
None of these functions does what its name implies when
CONFIG_SCHED_WALT=n. While all are currently unused, future patches
could introduce subtle bugs by calling any of them from non WALT
specific code. Delete the functions so it's obvious if new callers are
added.
Test: build kernel
Change-Id: Ib7552afb5668b48fe2ae56307016e98716e00e63
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
With CONFIG_SCHED_WALT disabled, is_min_capacity_cpu() is defined to
always return true, which breaks the intended behavior of
task_fits_max(). Revise is_min_capacity_cpu() to return correct
results.
An earlier version of this patch failed to handle the case when
min_cap_orig_cpu == -1 while sched domains are being updated due to
hotplug. Add a check for this case.
Test: trace shows increased top-app placement on medium cores
Bug: 117499098
Bug: 128477368
Bug: 130756111
Change-Id: Ia2b41aa7c57f071c997bcd0e9cdfd0808f6a2bf9
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
The estimated utilization for a task:
util_est = max(util_avg, est.enqueue, est.ewma)
is defined based on:
- util_avg: the PELT defined utilization
- est.enqueued: the util_avg at the end of the last activation
- est.ewma: a exponential moving average on the est.enqueued
samples
According to this definition, when a task suddenly change its bandwidth
requirements from small to big, the EWMA will need to collect multiple
samples before converging up to track the new big utilization.
This slow convergence towards bigger utilization values is not
aligned to the default scheduler behavior, which is to optimize for
performance. Moreover, the est.ewma component fails to compensate for
temporarely utilization drops which spans just few est.enqueued samples.
To let util_est do a better job in the scenario depicted above, change
its definition by making util_est directly follow upward motion and
only decay the est.ewma on downward.
Signed-off-by: Patrick Bellasi <patrick.bellasi@matbug.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
(am from https://lkml.org/lkml/2019/10/23/1071)
Change-Id: Ifbde836af2e903815904b1dbf44c782b7b66f9ce
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Because mark_reserved use for WALT and it's called by load_balance,
it leads to build breakage when WALT disabled. Executing the function
only if CONFIG_SCHED_WALT enabled.
Bug: 144142283
Test: Build and boot to home
Change-Id: I5cc3e3ece6a28c6cdabbe6964f6a6032ff2ea809
Signed-off-by: Kyle Lin <kylelin@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: If25bddeb70670d0fcaf93088ebf55ab3dc80b4e3
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I65d0d1ae7b633969a88e20a39750fff6279db460
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
between CPUs, allowing a CPU to run a real-time task up to 100% of the
time while leaving more space for non-real-time tasks to run on the CPU
that lend rt_runtime.
The problem is that a CPU can easily borrow enough rt_runtime to allow
a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
which are non-real-time by design.
This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
The feature will still be present for users that want to enable it,
though.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Wei Wang <wvw@google.com>
Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
(cherry picked from commit 2586af1ac187f6b3a50930a4e33497074e81762d)
Change-Id: Ibb1b185d512130783ac9f0a29f0e20e9828c86fd
Bug: 169673278
Test: build, boot and check the trace with RT task
Signed-off-by: Kyle Lin <kylelin@google.com>
Change-Id: Iffede8107863b02ad4a0cb902fc8119416931bdb
This reverts commit 6f58caae21.
It's not present in newer CAF kernels and Google removed it on their
4.14 devices as well.
Change-Id: I3675cbfe4a37ae9ed31bf3659a545965a0d59c6f
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
the previous definitions as well as the creation of this is locked behind CONFIG_ZRAM_LRU_WRITEBACK as well
Change-Id: I869b5595f69cc481e93ca6862b460594762d9b25
# Conflicts:
# drivers/block/zram/zram_drv.c
drivers/cpuidle/lpm-levels.o: In function `lpm_suspend_prepare':
/home/risen/android/ascendia/out/../drivers/cpuidle/lpm-levels.c:1750: undefined reference to `debug_masterstats_show'
/home/risen/android/ascendia/out/../drivers/cpuidle/lpm-levels.c:1751: undefined reference to `debug_rpmstats_show'
drivers/cpuidle/lpm-levels.o: In function `lpm_suspend_wake':
/home/risen/android/ascendia/out/../drivers/cpuidle/lpm-levels.c:1773: undefined reference to `debug_rpmstats_show'
/home/risen/android/ascendia/out/../drivers/cpuidle/lpm-levels.c:1774: undefined reference to `debug_masterstats_show'
make[1]: *** [/home/risen/android/ascendia/Makefile:1190: vmlinux] Error 1
- Return proper values when write wrappers aren't bypassed
- Revise Kconfig description
- Improve overall code style
- Don't write colocate and sched_boost_no_override values when WALT is
disabled
- Mark static data as static
- Improve readability of log messages
- Propagate cftype struct in write wrappers
- Use task_is_booster helper rather than hard-coded "init" check
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
[0ctobot: Squash kdrag0n/proton_zf6@12d005c with
kdrag0n/proton_zf6@eb73f2f]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
This implements a mechanism by which default SchedTune parameters
can be configured in-kernel, circumventing userspace, and
mitigating reliance on ramdisk modification in the context of
custom kernels.
[2.5V]: This version adds proper protection
from userspace (mainly init) trying to write lame
boost values and gives full control to developer
and user (sh is not blocked).
[V3.0]: Use a struct to store all the values.
[0ctobot: Update for msm-4.9 and improve coding style]
[YaroST12: Update for msm-4.14]
Co-authored-by: Adam W. Willis <return.of.octobot@gmail.com>
Co-authored-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Change-Id: I70b676014d580b7df0f2962a989579376e261d49
It is too bad to do a tight loop every adding pkt. When the hotspot is turned on, I notice that the
htt_htc_misc_pkt_list_trim() function consumes at least 5% of CPU time. By caching the head of pkt
queue and freeing multiple pkts at once to reduce CPU consumption.
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
POPP constantly attempts to lower the GPU's frequency behind the
governor's back in order to save power; however, the GPU governor in use
(msm-adreno-tz) is very good at determining the GPU's load and selecting
an appropriate frequency to run the GPU at.
POPP was created long ago, perhaps when msm-adreno-tz didn't exist or
didn't work so well, so it is clearly deprecated. Remove it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Currently, the kgsl worker thread is erroneously ranked right below
Android's audio threads in terms of priority.
The kgsl worker thread is in the critical path for rendering frames to
the display, so increase its priority to match the priority of the
display commit threads.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
In order to prevent redundant entry creation by racing against itself,
mb_cache_entry_create scans through a large hash-list of all current
entries in order to see if another allocation for the requested new
entry has been made. Furthermore, it allocates memory for a new entry
before scanning through this hash-list, which results in that allocated
memory being discarded when the requested new entry is already present.
This happens more than half the time.
Speed up cache entry creation by keeping a small linked list of
requested new entries in progress, and scanning through that first
instead of the large hash-list. Additionally, don't bother allocating
memory for a new entry until it's known that the allocated memory will
be used.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There is noticeable scheduling latency and heavy zone lock contention
stemming from rmqueue_bulk's single hold of the zone lock while doing
its work, as seen with the preemptoff tracer. There's no actual need for
rmqueue_bulk() to hold the zone lock the entire time; it only does so
for supposed efficiency. As such, we can relax the zone lock and even
reschedule when IRQs are enabled in order to keep the scheduling delays
and zone lock contention at bay. Forward progress is still guaranteed,
as the zone lock can only be relaxed after page removal.
With this change, rmqueue_bulk() no longer appears as a serious offender
in the preemptoff tracer, and system latency is noticeably improved.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Boost the DDR bus for 58 ms when requested in order to improve jitter.
The 3879 frequency step was determined empirically to be the minimum
needed to sustain acceptably low jitter in UIBench.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This driver boosts enumerated devfreq devices upon input, and allows for
boosting specific devfreq devices on other custom events. The boost
frequencies for this driver should be set so that frame drops are
near-zero at the boosted frequencies and power consumption is minimized
at said frequencies. The goal of this driver is to provide an interface
to achieve optimal device performance by requesting boosts on key
events, such as when a frame is ready to rendered to the display.
Currently, support is only present for boosting the cpu-llcc-ddr-bw
devfreq device, but the driver is structured in a way that makes it
easy to add support for new boostable devfreq devices in the future.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
An IRQ affinity notifier getting overwritten can point to some annoying
issues which need to be resolved, like multiple pm_qos objects being
registered to the same IRQ. Print out a warning when this happens to aid
debugging.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
On ARM, IRQs are executed on the first CPU inside the affinity mask, so
setting an affinity mask with more than one CPU set is deceptive and
causes issues with pm_qos. To fix this, only set the CPU0 bit inside the
affinity mask, since that's where IRQs will run by default.
This is a follow-up to "kernel: Don't allow IRQ affinity masks to have
more than one CPU".
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Even with an affinity mask that has multiple CPUs set, IRQs always run
on the first CPU in their affinity mask. Drivers that register an IRQ
affinity notifier (such as pm_qos) will therefore have an incorrect
assumption of where an IRQ is affined.
Fix the IRQ affinity mask deception by forcing it to only contain one
set CPU.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Giving userspace intimate control over CPU latency requirements is
nonsense. Userspace can't even stop itself from being preempted, so
there's no reason for it to have access to a mechanism primarily used to
eliminate CPU delays on the order of microseconds.
Remove userspace's ability to send pm_qos requests so that it can't hurt
power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
It isn't guaranteed a CPU will idle upon calling lpm_cpuidle_enter(),
since it could abort early at the need_resched() check. In this case,
it's possible for an IPI to be sent to this "idle" CPU needlessly, thus
wasting power. For the same reason, it's also wasteful to keep a CPU
marked idle even after it's woken up.
Reduce the window that CPUs are marked idle to as small as it can be in
order to improve power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The pm_qos callback currently suffers from a number of pitfalls: it
sends IPIs to CPUs that may not be idle, waits for those IPIs to finish
propagating while preemption is disabled (resulting in a long busy wait
for the pm_qos_update_target() caller), and needlessly calls a no-op
function when the IPIs are processed.
Optimize the pm_qos notifier by only sending IPIs to CPUs that are
idle, and by using arch_send_wakeup_ipi_mask() instead of
smp_call_function_many(). Using IPI_WAKEUP instead of IPI_CALL_FUNC,
which is what smp_call_function_many() uses behind the scenes, has the
benefit of doing zero work upon receipt of the IPI; IPI_WAKEUP is
designed purely for sending an IPI without a payload, whereas
IPI_CALL_FUNC does unwanted extra work just to run the empty
smp_callback() function.
Determining which CPUs are idle is done efficiently with an atomic
bitmask instead of using the wake_up_if_idle() API, which checks the
CPU's runqueue in an RCU read-side critical section and under a spin
lock. Not very efficient in comparison to a simple, atomic bitwise
operation. A cpumask isn't needed for this because NR_CPUS is
guaranteed to fit within a word.
Change-Id: Ic4dd7e4781172bb8e3b6eb13417a814256d44cf0
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
An empty IPI is useful for cpuidle to wake sleeping CPUs without causing
them to do unnecessary work upon receipt of the IPI. IPI_WAKEUP fills
this use-case nicely, so let it be used outside of the ACPI parking
protocol.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This allows pm_qos votes with, say, 100 us for example to select power
levels with exit latencies equal to 100 us. The extra microsecond of
exit latency doesn't hurt.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Currently most of the assembly files that use architecture extensions
enable them using the .arch directive but crc32.S uses .cpu instead. Move
that over to .arch for consistency.
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20200414182843.31664-1-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
The upcoming GCC 9 release extends the -Wmissing-attributes warnings
(enabled by -Wall) to C and aliases: it warns when particular function
attributes are missing in the aliases but not in their target.
In particular, it triggers here because crc32_le_base/__crc32c_le_base
aren't __pure while their target crc32_le/__crc32c_le are.
These aliases are used by architectures as a fallback in accelerated
versions of CRC32. See commit 9784d82db3eb ("lib/crc32: make core crc32()
routines weak so they can be overridden").
Therefore, being fallbacks, it is likely that even if the aliases
were called from C, there wouldn't be any optimizations possible.
Currently, the only user is arm64, which calls this from asm.
Still, marking the aliases as __pure makes sense and is a good idea
for documentation purposes and possible future optimizations,
which also silences the warning.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Allow architectures to drop in accelerated CRC32 routines by making
the crc32_le/__crc32c_le entry points weak, and exposing non-weak
aliases for them that may be used by the accelerated versions as
fallbacks in case the instructions they rely upon are not available.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Improve the performance of the crc32() asm routines by getting rid of
most of the branches and small sized loads on the common path.
Instead, use a branchless code path involving overlapping 16 byte
loads to process the first (length % 32) bytes, and process the
remainder using a loop that processes 32 bytes at a time.
Tested using the following test program:
#include <stdlib.h>
extern void crc32_le(unsigned short, char const*, int);
int main(void)
{
static const char buf[4096];
srand(20181126);
for (int i = 0; i < 100 * 1000 * 1000; i++)
crc32_le(0, buf, rand() % 1024);
return 0;
}
On Cortex-A53 and Cortex-A57, the performance regresses but only very
slightly. On Cortex-A72 however, the performance improves from
$ time ./crc32
real 0m10.149s
user 0m10.149s
sys 0m0.000s
to
$ time ./crc32
real 0m7.915s
user 0m7.915s
sys 0m0.000s
Cc: Rui Sun <sunrui26@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Unlike crc32c(), which is wired up to the crypto API internally so the
optimal driver is selected based on the platform's capabilities,
crc32_le() is implemented as a library function using a slice-by-8 table
based C implementation. Even though few of the call sites may be
bottlenecks, calling a time variant implementation with a non-negligible
D-cache footprint is a bit of a waste, given that ARMv8.1 and up mandates
support for the CRC32 instructions that were optional in ARMv8.0, but are
already widely available, even on the Cortex-A53 based Raspberry Pi.
So implement routines that use these instructions if available, and fall
back to the existing generic routines otherwise. The selection is based
on alternatives patching.
Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since
CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE,
this just codifies the status quo.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The current delay implementation uses the yield instruction, which is a
hint that it is beneficial to schedule another thread. As this is a hint,
it may be implemented as a NOP, causing all delays to be busy loops. This
is the case for many existing CPUs.
Taking advantage of the generic timer sending periodic events to all
cores, we can use WFE during delays to reduce power consumption. This is
beneficial only for delays longer than the period of the timer event
stream.
If timer event stream is not enabled, delays will behave as yield/busy
loops.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
The arch timer configuration for a CPU might get reset after suspending
said CPU.
In order to reliably use the event stream in the kernel (e.g. for delays),
we keep track of the state where we can safely consider the event stream as
properly configured. After writing to cntkctl, we issue an ISB to ensure
that subsequent delay loops can rely on the event stream being enabled.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>