Text
Broken L2 cache reporting on crostini
With Google's recent announcement of support for running real Linux apps on Chrome OS, I picked up a Pixelbook, since I've been long awaiting the viability of Chromebooks as development machines.
After setting up a dev VM and experimenting with various projects, I found that one Tensorflow application I was playing with would lock up, hard, inside the Crostini VM on my Chromebook.
After adding some debug prints, I discovered that virtually any calls into numpy.linalg.inv were hanging. I could reproduce as simply as:
python3 -c 'import numpy as np; np.linalg.inv(np.identity(3))'
Googling found https://github.com/numpy/numpy/issues/11041, which was similar, but my bug was far worse, and the workaround on that issue didn't solve my problem. It did, however, point me at the OPENBLAS_NUM_THREADS=1 environment variable, which limited openblas to a single thread; This would prove helpful later.
I suspected a Crostini bug at this point, since this was pretty basic functionality to be broken, but I reported a numpy bug in the interim while I debugged. (numpy ended up (correctly!) also guessing this was a crostini bug, but were also able to provide some helpful debugging pointers)
I stopped a hung process (with OPENBLAS_NUM_THREADS=1 to simplify the situation) in gdb and got a stack trace:
(gdb) bt #0 0x00007ffff445a5b8 in dtrsm_oltucopy_PRESCOTT () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #1 0x00007ffff426aad3 in dtrsm_LNLU () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #2 0x00007ffff43b0a24 in dgetrs_N_single () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #3 0x00007ffff4191965 in dgesv_ () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #4 0x00007ffff1e5a103 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so #5 0x00007ffff3aaed24 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #6 0x00007ffff3aaf538 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #7 0x00007ffff3ab0ddf in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #8 0x000055555575d647 in PyObject_Call () #9 0x00005555556d4ee1 in PyEval_EvalFrameEx () #10 0x00005555556d493f in PyEval_EvalFrameEx () #11 0x00005555556d9286 in ?? () #12 0x00005555556d9f9f in PyEval_EvalCode () #13 0x00005555556b89bf in PyRun_StringFlags () #14 0x00005555557a9f3c in PyRun_SimpleStringFlags () #15 0x00005555557d8602 in Py_Main () #16 0x0000555555668c01 in main ()
This confirmed we're hung in OpenBLAS, and in particular tells us that numpy ships its own OpenBLAS. Debug symbols would almost certainly help here, so I installed Debian's OpenBLAS, and the corresponding debug symbols:
$ sudo apt install libopenblas-base libopenblas-base-dbgsym
Now we can force load that version, and get better symbols:
$ env LD_PRELOAD=/usr/lib/libopenblasp-r0.2.19.so OPENBLAS_NUM_THREADS=1 gdb --args python3 -c 'import numpy as np; np.linalg.inv(np.identity(3))' ... (gdb) bt #0 dtrsm_oltucopy_PRESCOTT (m=3, n=0, a=<optimized out>, lda=3, offset=<optimized out>, b=<optimized out>) at generic/trsm_ltcopy_4.c:346 #1 0x00007ffff5e5e8b4 in dtrsm_LNLU (args=args@entry=0x7fffffffbc10, range_m=range_m@entry=0x0, range_n=range_n@entry=0x0, sa=sa@entry=0x7fffecaa5000, sb=sb@entry=0x7fffecaa5100, dummy=dummy@entry=0) at trsm_L.c:153 #2 0x00007ffff5fa7765 in dgetrs_N_single (args=args@entry=0x7fffffffbc10, range_m=range_m@entry=0x0, range_n=range_n@entry=0x0, sa=sa@entry=0x7fffecaa5000, sb=sb@entry=0x7fffecaa5100, mypos=mypos@entry=0) at getrs_single.c:51 #3 0x00007ffff5d7d978 in dgesv_ (N=<optimized out>, NRHS=0x7fffffffbdcc, a=<optimized out>, ldA=<optimized out>, ipiv=<optimized out>, b=<optimized out>, ldB=0x7fffffffbdd4, Info=0x7fffffffbda0) at lapack/gesv.c:127 #4 0x00007fffef583103 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so #5 0x00007ffff11d7d24 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #6 0x00007ffff11d8538 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #7 0x00007ffff11d9ddf in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #8 0x000055555575d647 in PyObject_Call () #9 0x00005555556d4ee1 in PyEval_EvalFrameEx () #10 0x00005555556d493f in PyEval_EvalFrameEx () #11 0x00005555556d9286 in ?? () #12 0x00005555556d9f9f in PyEval_EvalCode () #13 0x00005555556b89bf in PyRun_StringFlags () #14 0x00005555557a9f3c in PyRun_SimpleStringFlags () #15 0x00005555557d8602 in Py_Main () #16 0x0000555555668c01 in main ()
Line numbers! Variable names! Oh my!
After taking a few stack traces to see where we're stuck, I became pretty convinced we were stuck in this loop:
for(is = ls + min_i; is < ls + min_l; is += GEMM_P){ #ifndef TRANSA TRSM_ILTCOPY(min_l, min_i, a + (is + ls * lda) * COMPSIZE, lda, is - ls, sa); #else TRSM_IUNCOPY(min_l, min_i, a + (ls + is * lda) * COMPSIZE, lda, is - ls, sa); #endif TRSM_KERNEL(min_i, min_j, min_l, dm1, #ifdef COMPLEX ZERO, #endif sa, sb, b + (is + js * ldb) * COMPSIZE, ldb, is - ls); }
Looking at the assembly, we find the loop ends with a
0x00007ffff5e5e8f8 : movslq 0x280(%rax),%r10 0x00007ffff5e5e8ff : add %r10,%rbp 0x00007ffff5e5e902 : cmp %r15,%rbp 0x00007ffff5e5e905 : jl 0x7ffff5e5e870 <dtrsm_lnlu>
I set a breakpoint on 0x00007ffff5e5e8ff and inspected %r10, which I was pretty sure was the GEMM_P increment in the for loop above:
gdb) b *0x00007ffff5e5e8ff Breakpoint 1 at 0x7ffff5e5e8ff: file trsm_L.c, line 148. (gdb) c Continuing. Breakpoint 1, 0x00007ffff5e5e8ff in dtrsm_LNLU (args=args@entry=0x7fffffffbc10, range_m=range_m@entry=0x0, range_n=range_n@entry=0x0, sa=sa@entry=0x7fffecaa5000, sb=sb@entry=0x7fffecaa5100, dummy=dummy@entry=0) at trsm_L.c:148 148 in trsm_L.c (gdb) p $r10 $1 = 0
So we're looping forever because we're looping over something by an increment of 0. Hm. Now, where does that come from? We can ask the debugger if that address points to a symbol:
(gdb) x/lx (void*)$rax + 0x280 0x7ffff7dadf40 <gotoblas_prescott>: 0x00000000
so GEMM_P appears to be a macro(?) that expands into an offset after a symbol named gotoblas_PRESCOTT. Github suggests that symbol is a gotoblas_t: https://github.com/xianyi/OpenBLAS/blob/26e1cfb65314a5579cc74aa8d7d30660db3e9ee1/driver/others/dynamic.c#L58
gdb tells us that struct is in fact larger than 0x280 bytes, so we're pulling a field from inside. Scanning the fields (via ptype) we find several named things like gemm_p, and in fact find one at offset 0x280:
(gdb) p sizeof(gotoblas_t) $3 = 3992 (gdb) ptype gotoblas_t type = struct { .... } (gdb) p &((gotoblas_t*)0)->dgemm_p $4 = (int *) 0x280
I tried messing with watchpoints to catch the initialization of dgemm_p, but to no avail. Code search, however, revealed a whole bunch of assignments in a nested maze of #ifdef conditionals, all in one function in setparam-ref.c (I would later learn that OpenBLAS compiles this file once per architecture with a different mix of #defines, but that's mostly just fun trivia...).
We note that it's pulling the size, in most of these branches, based on l2, which is the result of:
int l2 = get_l2_size();
Is it possible the custom hypervisor used by crostini is incorrectly reporting a size-0 L2 cache? We read get_l2_size; the crux of it is:
cpuid(0x80000006, &eax, &ebx, &ecx, &edx);
Some quick googling confirms that cpuid query 0x80000006 returns information about the L2 cache. With the help of the cpuid(1) command-line tool, we can see what our virtual CPU returns:
$ cpuid -1 -r -l 0x80000006 CPU: 0x80000006 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
All zeros! So, the hypervisor that manages the VMs on Google's new Crostini environment is failing to configure the CPUID values for L2 cache size, resulting in OpenBLAS seeing a 0-size L2 cache, and looping forever as it tries to loop over data in L2-sized chunks!
I filed two bugs, one against Google about the cpuid issue, and one against OpenBLAS asking for greater robustness in this particular edge case. The latter is already fixed, and the former has been confirmed and will hopefully be resolved soon.
2 notes
·
View notes
Text
Experience report: Trying to map over over a binary tree in Rust
This story is simplified from an attempt to write an AST walker for a toy compiler, but the essential facts are unchanged.
I am fairly new to Rust (but an experienced C++ programmer) and have been trying to pick up Rust recently. This evening I spent a miserable hour trying to write a function to map an FnMut over a binary tree. I started with a simple tree definition, and try to write what seems to me to be the straightforward code:
use std::rc::Rc; #[derive(Debug)] enum Tree { Leaf(i32), Node(Rc<Tree>, Rc<Tree>), } fn map_tree<F>(tree: &Rc<Tree>, f: F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, f), map_tree(&r, f))), } } pub fn main() { let tree = Rc::new( Tree::Node(Rc::new(Tree::Leaf(0)), Rc::new(Tree::Leaf(1))), ); let mut sum = 0; map_tree(&tree, |node| { match *node { Tree::Leaf(i) => sum = sum + i, _ => unreachable!() } node }); print!("sum={}", sum); }
(playground link)
I've got a vague premonition that the borrow-checker will yell at me for using f twice in that latter clause, but hey, let's see what it says. It says:
error[E0596]: cannot borrow immutable argument `f` as mutable --> src/main.rs:14:27 | 9 | fn map_tree<F>(tree: &Rc<Tree>, f: F) -> Rc<Tree> | - consider changing this to `mut f` ... 14 | &Tree::Leaf(_) => f(Rc::clone(tree)), | ^ cannot borrow mutably error[E0382]: use of moved value: `f` --> src/main.rs:15:87 | 15 | &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, f), map_tree(&r, f))), | - ^ value used here after move | | | value moved here | = note: move occurs because `f` has type `F`, which does not implement the `Copy` trait
Well, I was right. Let's start by following the first piece of advice, from the E0596:
fn map_tree<F>(tree: &Rc<Tree>, mut f: mut F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, f), map_tree(&r, f))), } }
(playground link)
First error is resolved, but the second remains unchanged:
error[E0382]: use of moved value: `f` --> src/main.rs:15:87 | 15 | &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, f), map_tree(&r, f))), | - ^ value used here after move | | | value moved here | = note: move occurs because `f` has type `F`, which does not implement the `Copy` trait
Well, I've started to pick up Rust; I know what to do when I'm using a value temporarily and then want to use it again: I probably need more borrows. Because I'm new to this, I forget to borrow mutably:
fn map_tree<F>(tree: &Rc<Tree>, mut f: F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, &f), map_tree(&r, &f))), } }
(playground link)
This makes that error go away, but replaces it with a different -- very confusing -- spew:
error[E0277]: the trait bound `F: std::ops::Fn<(std::rc::Rc<Tree>,)>` is not satisfied --> src/main.rs:15:57 | 15 | &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, &f), map_tree(&r, &f))), | ^^^^^^^^ the trait `std::ops::Fn<(std::rc::Rc<Tree>,)>` is not implemented for `F` | = help: consider adding a `where F: std::ops::Fn<(std::rc::Rc<Tree>,)>` bound = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&F` note: required by `map_tree` --> src/main.rs:9:1 | 9 | / fn map_tree<F>(tree: &Rc<Tree>, mut f: F) -> Rc<Tree> 10 | | where 11 | | F: FnMut(Rc<Tree>) -> Rc<Tree>, 12 | | { ... | 16 | | } 17 | | } | |_^
Why is std::ops::Fn suddenly appearing? I said FnMut. What is going on?
I eventually google around and realize I need more muts:
fn map_tree<F>(tree: &Rc<Tree>, mut f: F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, &mut f), map_tree(&r, &mut f))), } }
(playground link)
This now compiles the implementation, but blows up with a recursion limit at the call site!
error[E0275]: overflow evaluating the requirement `[closure@src/main.rs:22:21: 28:6 sum:&mut i32]: std::ops::FnMut<(std::rc::Rc<Tree>,)>` | = help: consider adding a `#![recursion_limit="128"]` attribute to your crate = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut [closure@src/main.rs:22:21: 28:6 sum:&mut i32]` ...
At this point I'm pretty stuck and pretty frustrated.
But I need to break the recursion somehow, so maybe I need to also make map_vars take a reference so the inner and outer calls have the same type? I've not yet mastered Rust syntax, but I know &mut is a thing, and I've already got a mut in the signature, so let's stick an & on it:
fn map_tree<F>(tree: &Rc<Tree>, &mut f: F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, &mut f), map_tree(&r, &mut f))), } }
(playground link)
New error!
error[E0308]: mismatched types --> src/main.rs:9:33 | 9 | fn map_tree<F>(tree: &Rc<Tree>, &mut f: F) -> Rc<Tree> | ^^^^^^ expected type parameter, found &mut _ | = note: expected type `F` found type `&mut _` = help: did you mean `mut f: &F`?
Well, there's a helpful "did you mean"; let's try that:
fn map_tree<F>(tree: &Rc<Tree>, mut f: &F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, &mut f), map_tree(&r, &mut f))), } }
(playground link)
No luck, and we're back to complaining about Fn, which I still haven't mentioned anywhere:
error[E0277]: the trait bound `F: std::ops::Fn<(std::rc::Rc<Tree>,)>` is not satisfied --> src/main.rs:15:57 | 15 | &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, &mut f), map_tree(&r, &mut f))), | ^^^^^^^^ the trait `std::ops::Fn<(std::rc::Rc<Tree>,)>` is not implemented for `F` | = help: consider adding a `where F: std::ops::Fn<(std::rc::Rc<Tree>,)>` bound = note: required because of the requirements on the impl of `std::ops::FnMut<(std::rc::Rc<Tree>,)>` for `&F` note: required by `map_tree` --> src/main.rs:9:1 | 9 | / fn map_tree<F>(tree: &Rc<Tree>, mut f: &F) -> Rc<Tree> 10 | | where 11 | | F: FnMut(Rc<Tree>) -> Rc<Tree>, 12 | | { ... | 16 | | } 17 | | } | |_^
At this point I gave up and asked a Rustacean I knew on IRC, who pointed out that I was so close, and yet so far:
fn map_tree<F>(tree: &Rc<Tree>, f: &mut F) -> Rc<Tree> where F: FnMut(Rc<Tree>) -> Rc<Tree>, { match &**tree { &Tree::Leaf(_) => f(Rc::clone(tree)), &Tree::Node(ref l, ref r) => Rc::new(Tree::Node(map_tree(&l, f), map_tree(&r, f))), } }
(playground link)
I can, with the benefit of hindsight, more-or-less understand what happened and the error messages at every step of the way. However, the experience of getting there was really frustrating, notably because at no point did the "did you mean" suggestions actually lead me particularly closer to my goal, and because I had to go by way of a compiler recursion overflow(!) before I could reach a working program.
I want to be really excited about Rust; We nede something that can replace C++'s capacity for powerful zero-cost abstractions and native compilation, while being less ridden with awful memory safety bugs. So far, though, personally, this interaction is honestly about median in terms of my experience trying to write Rust. I'm sure it will get better with time, but right now I'm grumpy.
1 note
·
View note
Text
When observing your database stalls it
Recently, on my other blog accidentallyquadratic, I documented a case of accidentally quadratic behavior in /proc/$pid/maps on a wide range of recent Linux kernels.
While this bug is amusing, it might initially not seem that important; /proc/$pid/maps is primarily a debugging or inspection tool, and while 30s access times aren’t pleasant, they probably aren’t breaking anything too critical.
Today I want to explore, by way of some microbenchmarks, the more pernicious impact of that bug.
I wrote a microbenchmark that’s designed to simulate the load of of a network server (perhaps a database of some sort) that relies on mmap to access a large file on disk. The benchmark allocates and maps a large file, and then spawns three sets of threads:
A large number of idle threads, simulating idle connections
A number of “reader” threads, which repeatedly read from random memory pages of the mapped region. These would simulate threads servicing requests.
A number of “mapper” threads, which repeatedly mmap and munmap small regions in memory. These might simulate background threads, or connection churn, which might mmap and munmap thread stacks on a real service.
Unless stated otherwise, I ran all the following experiments on a 200GB mapping on an i2.2xlarge ec2 instance running Ubuntu Trusty on a 3.13 kernel.
We start with a baseline of 10,000 idle threads, 10 readers, and no mappers:
(blue is p90, red is p99, measured over batches of 1000 consecutive reads in a single thread)
First observation: we can invoke a page fault handler, read a page from SSD, update the page tables, and get back to userspace, in ~500µs at p99, in 10 concurrent threads. Computers are fast.
Now, in that previous chart, at t=20s, I actually ran cat /proc/$pid/maps > /dev/null in a separate shell. You can see that it had minimal if any effect, as you might hope (it ate 10s or so of CPU time, but disk speed wasn’t impacted).
However, now let’s re-run with an identical configuration, but a single thread running mmap and munmap in a tight loop:
Once again, at t=20s I’ve run a cat /proc/$pid/maps > /dev/null.
Whoa! It’s lost in the scale of the graph, but the baseline performance is pretty similar to the previous case. However, inspecting the maps file spikes performance to upwards of 100ms!
What’s Happening?
On Linux, the address space of a process (the description of which regions of memory are backed by why – anonymous mappings, file mmaps, etc) is protected by a shared reader-writer lock. All threads in a single process share an address space, and thus share the corresponding reader-writer lock (called mmap_sem in the kernel)
Handling a page fault requires a read lock on mmap_sem: A page fault has to look up the memory region containing the fault, but it doesn’t change the virtual memory layout.
Handling /proc/$pid/maps has to walk the virtual memory layout, so it also requires a read lock. Thus, it can procede in parallel with page faults.
Handling mmap mutates the memory layout, so it requires a write lock. It only needs to hold the lock long enough to insert an entry into a red-black tree, which is ~instantaneous compared to going out to disk, and thus doesn’t affect performance much by itself.
However, the Linux rw_semaphore is writer-priority, and it is when we add all of these ingredients together that we get trouble.
The /proc/$pid/maps call holds the mmap_sem for reading while it’s busy doing quadratic work looking up all those stacks. If, while it’s running, a writer attempts to enter, that writer will have to wait for it to complete.
But, because the lock is writer-priority, this blocks any new readers from entering. Thus, in the presence of a /proc/$pid/maps reader and an mmap thread, we find all readers now blocking on the slow /proc/$pid/maps readers, with disastrous results.
(Why do we see a bunch of writes get through, instead of a complete stoppage during the read? Files in /proc/ tend to only hold the lock while they write a batch of entries to userspace, and then drop and reacquire it. This behavior is necessary to avoid holding locks forever while waiting on a slow userspace reader)
2 notes
·
View notes
Text
nsd+kernel 3.2 memory leak
This one is a little boring in that it's not a new bug, but tracking it down was still real exciting.
A while back, Stripe started experiencing some serious intermittent sadness with our internal DNS servers. DNS queries would time out or fail to return, and our DNS servers would periodically OOM kill, despite no application appearing to use much memory.
This incident was during the era of our consul battles, so we suspected consul, but were unable to prove its complicity in any way.
Finally I happened to take a hard look at the OOM-killer spew from the kernel and noticed something odd:
[10726504.368573] Node 0 DMA32 free:47980kB min:4176kB low:5220kB high:6264kB active_anon:49664kB inactive_anon:10648kB active_file:128kB inactive_file:12kB unevictable:0kB isolated(anon):204kB isolated(file):0kB present:4112640kB mlocked:0kB dirty:0kB writeback:10476kB mapped:72kB shmem:40kB slab_reclaimable:1036kB slab_unreclaimable:3935056kB kernel_stack:384kB pagetables:1208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:107180 all_unreclaimable? yes [10726504.368583] Node 0 Normal free:7848kB min:11548kB low:14432kB high:17320kB active_anon:244132kB inactive_anon:24816kB active_file:36kB inactive_file:0kB unevictable:0kB isolated(anon):384kB isolated(file):240kB present:11362176kB mlocked:0kB dirty:0kB writeback:24828kB mapped:1912kB shmem:2356kB slab_reclaimable:18140kB slab_unreclaimable:10799296kB kernel_stack:2056kB pagetables:9696kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:409272 all_unreclaimable? yes
There's a looooot of spew in there but if we just look for the largest numbers we spot
slab_unreclaimable:3935056kB slab_unreclaimable:10799296kB
The “DMA32” zone refers to all memory below the 4G boundary on a 64-bit kernel, so that first number (3.75GiB) is most of that, and the total is awfully close to the 14GB of total RAM in the box.
In the Linux kernel “slab” refers to the kernel's slab allocator, used for internal allocations by the kernel itself. So essentially all of the box's memory is being used by the Linux Kernel itself, not user applications, which explains why it's swapping itself to death and OOM-killing even though no application is using much memory.
Logging into a not-yet-dead DNS server, I was able to confirm this and also get more information by looking at /proc/slabinfo:
root@dns:~# perl -lane 'print $F[2]*$F[3], " ", $F[0]' /proc/slabinfo | sort -n | tail 2825504 vm_area_struct 3117056 kmalloc-256 3231360 ext4_inode_cache 3777200 radix_tree_node 4060000 inode_cache 5475456 dentry 7548928 kmalloc-128 15599376 nf_conntrack_ffffffff81c9d500 4165951104 anon_vma 5672069040 shared_policy_node
(slabinfo has one entry for each slab allocator, each allocator being used for a single type of object. It contains a lot of fields – documented in the first line of the file – but the first few are
# name <active_objs><num_objs><objsize>
so by multiplying “num_objs” by “objsize” we get a total memory usage. I later learned about slabtop(1), which is essentially a more sophisticated version of that perl invocation)
So nearly all of the memory usage is coming from two slabs in particular, the “anon_vma” objects and “shared_policy_node”. “anon_vma” is the name for a bookkeeping structure for “anonymous” mappings, i.e. those not backed by a file (e.g. created by brk or anonymous mmap), and some googling told me that “shared_policy_node” is some kind of NUMA bookkeeping data. So there's clearly something up with kernel memory management, but that's not super useful.
It did however enable me to make more directed experiments: Stopping or restarting the consul agent on the box did not meaningfully change the slab allocations, suggesting consul might not be the culprit.
Eventually, the right combination of googling around “anon_vma” and “leak” or similar, brought me to https://lkml.org/lkml/2012/8/15/765 , which sounded exactly like our issue. Furthermore, as part of our debugging, I'd learned that nsd, our DNS server, used a similar fork dance to do graceful reloads.
Armed with that knowledge, I was able to confirm that doing a complete restart (as opposed to graceful reload) of nsd caused the memory to be released, and that doing a thousand graceful reloads caused those numbers to grow rapidly.
The precipitating change, therefore, turned out to be a change we'd made recently to automatically re-populate zone files from consul data and reload NSD; By automatically reloading NSD with a high frequency, it allowed this leak to build up to the point of becoming critical.
Once we'd found it, it was also easy to notice http://www.nlnetlabs.nl/blog/category/nsd/nsd4-nsd/ , which described upstream nsd identifying this bug and working around.
We fixed by upgrading our DNS servers to Ubuntu Trusty (which we'd been meaning to do anyways), which both fixed the bug-triggering behavior in a newer version of NSD, and upgraded us to a new kernel without the memory leak.
0 notes
Text
Broken POSIX ACLs on NFS on Kernel 3.16
We use Vagrant for development at Stripe, using NFS mounts to share code from the host into the Vagrant dev box.
We use bundler to manage Ruby dependencies, and configure it to install gems directly into the project directory, inside the vendor/ subdirectory. Installing gems there, instead of globally, ensures isolation, and preserves gems across re-creation of the Vagrant VM, which means you don't need to wait for a bunch of gems to download if you blow away your VM.
Recently we upgraded our Vagrant image to one based off of the just-released Ubuntu 14.04.2, and users reported permission errors installing gems with bundle install. The errors looked something like:
Gem::Ext::BuildError: ERROR: Failed to build gem native extension. … make "DESTDIR=" install make: Warning: File `Makefile' has modification time 2.7e+03 s in the future /usr/bin/install -c -m 0755 icunicode.so ./.gem.20150417-8657-1bb5dt1 /usr/bin/install: setting permissions for ‘./.gem.20150417-8657-1bb5dt1/icunicode.so’: Operation not permitted make: *** [install-so] Error 1 make install failed, exit code 2 Gem files will remain installed in vendor/bundle/bundler/gems/icunicode-87ce99507758 for inspection. Results logged to vendor/bundle/bundler/gems/extensions/x86_64-linux/2.1.0/icunicode-87ce99507758/gem_make.out An error occurred while installing icunicode (0.1.4), and Bundler cannot continue. Make sure that `gem install icunicode -v '0.1.4'` succeeds before bundling.
Notably the error was
/usr/bin/install -c -m 0755 icunicode.so ./.gem.20150417-8657-1bb5dt1 /usr/bin/install: setting permissions for ‘./.gem.20150417-8657-1bb5dt1/icunicode.so’: Operation not permitted
As is my habit for debugging bizarre permission errors, I immediately turned to strace to try to understand what was happening:
$ strace -fo /tmp/bundle.strace bundle install
This produced 150k lines of strace output, but searching for EPERM quickly showed the real error:
8901 open("./.gem.20150417-8786-11yufdb/thrift_native.so", O_WRONLY|O_CREAT|O_EXCL, 0600 <unfinished ...> 8901 <... open resumed> ) = 4 … 8901 fsetxattr(4, "system.posix_acl_access", "\x02\x00\x00\x00\x01\x00\x06\x00\xff\xff\xff\xff\x04\x00\x00\x00\xff\xff\xff\xff \x00\x00\x00\xff\xff\xff\xff", 28, 0) = -1 EPERM (Operation not permitted)
Oh boy. fsetxattr is used to manipulate POSIX extended attributes, and system.posix_acl_access specifically means it's trying to set a POSIX ACL. We don't use POSIX ACLs, I have no idea if OS X's NFS server even supports them, and so whatever is going on is just silly. But somehow I need to make that call succeed in order for this to work.
Since I'm not using POSIX ACLs at all, my first try is to disable them by mounting the NFS mount with noacl, which should just turn off extended ACLs on the mount, ideally making the syscall fail with EOPNOTSUPP, which install will identify as successful in this case. Unfortunately, that produces no change – the call still fails identically.
At this point, I restort to a bunch of frantic googling about various combinations of "nfs", "vagrant", "POSIX ACLS", and so on, to no avail.
By now, this has become a Mystery, so I get curious and start digging in earnest. The error is coming from inside the kernel, so it's time to dive into the kernel. There are a number of options here, but I decided to reach for ftrace, the kernel function tracer, which, among other nice properties, works out of the box on recent kernels.
ftrace will allow me to get a trace of every function call made inside the kernel during the execution of some program. For mysterious error codes from syscalls, I find this usually suffices to point to the problem: The error usually comes from the last thing to happen before the syscall returns, and knowing exactly which code is returning the error usually suffices to figure out what the error is.
I confirm that a simple install invoke on the NFS is mount is sufficient to reproduce the problem, which simplifies testing:
$ install Rakefile vendor/ install: setting permissions for ‘vendor/Rakefile’: Operation not permitted
I install trace-cmd, and record a trace:
$ sudo trace-cmd record -p function_graph install Rakefile vendor plugin 'function_graph' Kernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 0 overrun: 70180 commit overrun: 0 bytes: 3220 oldest event ts: 12293.045060 now ts: 12293.335890 dropped events: 0 read events: 218340 CPU: 1 entries: 0 overrun: 2175 commit overrun: 0 bytes: 1288 oldest event ts: 12293.045078 now ts: 12293.335951 dropped events: 0 read events: 179846 CPU0 data recorded at offset=0x40b000 6168576 bytes in size CPU1 data recorded at offset=0x9ed000 5083136 bytes in size
At this point, I notice something odd: The command succeeded! I speculate that this is because I'm running it as root (which I had to in order to access ftrace), which is easy to confirm:
$ sudo install Rakefile vendor/ $
Huh. OK. That's not totally surprising for a permission error, although with NFS in the mix, who even knows. But in any case, we're gonna need to trace the command as my normal user. This is easily done, if more complicated:
$ bash -c 'sudo trace-cmd start -p function_graph -P $$; exec install Rakefile vendor'; sudo trace-cmd stop plugin 'function_graph' install: setting permissions for ‘vendor/Rakefile’: Operation not permitted
We start a subshell, start tracing just that PID, and then exec to overlay the install command into the same PID. We can then stop the trace and view it with sudo trace-cmd extract && sudo trace-cmd report | less. Searching that output for setxattr, we find the call chain:
| sys_fsetxattr() { | __fdget() { 0.039 us | __fget_light(); 0.266 us | } | mnt_want_write_file() { 0.058 us | __sb_start_write(); | __mnt_want_write_file() { 0.038 us | mnt_clone_write(); 0.365 us | } 0.860 us | } | setxattr() { | __kmalloc() { 0.366 us | kmalloc_slab(); 0.024 us | _cond_resched(); 0.930 us | } 0.053 us | posix_acl_fix_xattr_from_user(); | vfs_setxattr() { 0.154 us | xattr_permission(); | mutex_lock() { 0.025 us | _cond_resched(); 0.259 us | } | security_inode_setxattr() { 0.064 us | cap_inode_setxattr(); 0.189 us | ima_inode_setxattr(); | evm_inode_setxattr() { | evm_protect_xattr.isra.2() { 0.691 us | evm_protected_xattr(); 0.088 us | posix_xattr_acl(); 0.107 us | evm_verify_current_integrity(); 1.766 us | } 2.158 us | } 3.305 us | } | __vfs_setxattr_noperm() { | generic_setxattr() { 0.111 us | xattr_resolve_name(); | posix_acl_xattr_set() { | inode_owner_or_capable() { | ns_capable() { | security_capable() { | apparmor_capable() { 0.037 us | cap_capable(); 0.348 us | } 0.753 us | } 1.003 us | } 1.482 us | } 1.784 us | } 2.735 us | } 3.018 us | } 0.032 us | mutex_unlock(); 8.223 us | } 0.168 us | kfree(); + 10.707 us | }
The most interesting thing to note here is that we never even hit NFS code at all! The permission error is coming entirely from the generic POSIX ACL code.
It appears that what is going on is that the kernel tries to permission-check the setxattr by checking inode_owner_or_capable: The current user needs to either own the file, or have have the CAP_FOWNER capability.
I happened to know from past experience that Vagrant relies on the mapall option in the host exports file, which translates user IDs on the server. The nelhage user on my host laptop has UID 501, and the vagrant user inside the VM is uid 1000. In order to allow this to work, Vagrant configures the NFS server with -mapall=501, which causes the NFS server to ignore UIDs from the client and perform all operations as UID 501.
However, this mapping is entirely server-side; On the VM, the kernel sees files owned by uid 501, and sees uid 1000 accessing them. This is totally fine, since the server translates things and is the ultimate arbiter of access, but it means that the generic inode_owner_or_capable is inappropriate here, and fails out. If the code had just passed the call directly into the NFS code, it would have failed with EOPNOTSUPP, which install will correctly just ignore.
I haven't fixed this or filed a bug with the kernel yet, although I did file an Ubuntu bug. Once we realized it was a kernel bug, it became clear that the behavior broke because we had upgraded to the 3.16 "hardware enablement kernel", and we downgraded our development boxes to 3.13, which fixes the issue for now: the commit which ported NFS to the generic layer landed in 3.14.
0 notes
Text
Writing a Brainfuck Quine
I was drinking with some coworkers and mentioned I'd never written a brainfuck quine. So, of course, as soon as we got back to computers, they start timing me.
It took me just over 30 minutes to produce this incredibly-verbose quine. I figured I'd do a quick writeup of how it works.
If you're not familiar with Brainfuck, you can get a quick refresher here. It's incredibly simple, so I won't go into detail here.
Most simple quines can be divided into two parts, which I'll refer to as the code and the data (obviously they're all code in the sense that they appear as part of the source, but they're conceptually very different).
The data is just a representation of the source code for the "code", in some machine-readable way. The job of the code is then to print out the data twice: The first time, in the same form the data was formatted in the input, and the second time, to use the data to print out the source code of the code itself. We can see this structure in the classic English "quine":
Print the following sentence, followed by its quotation: "Print the following sentence, followed by its quotation."
The second sentence, the quoted one, is the "data" -- it is not interpreted by the person following the instructions, just copied around. The first sentence is the "code," and contains the key two parts: "print the following sentence" prints the code, and "followed by its quotation" prints a representation of the data.
So, in order to write a brainfuck quine, we're going to represent brainfuck code as data generated by other brainfuck code, in such a way that:
We can use that data to print out the brainfuck code that generated the data.
We can use that data to print out the brainfuck code described by that data.
I've posted an annotated version of the source code, to help following along as we walk through the code.
The easiest (if awfully verbose) representation is trivial: We'll represent a character of brainfuck as a cell containing its ASCII value, and we'll generate that cell by N +s in a row.
In order to leave room for scratch computation, I decided to represent each value by N followed by three empty (0) cells. This turned out to be more than I needed, but was convenient.
So, in order to represent, say, a + in the code, we'll emit this:
+++++++++++++++++++++++++++++++++++++++++++>>>>
We'll also start with an extra >>>> to make it easier to find the start of the tape again.
Now we need to write the actual "code" which is the meat of the program. First it needs to walk over the data, and emit the same code that we wrote to generate the data.
Before we can do anything, skip back to the start of the tape:
<<<<[<<<<]
(The final >>>> will have skipped us off the end, so we back up once to find the last character, then repeat until we find the sentinel zeros we left at the start).
We need to emit the leading >>>>:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++....[-]
We generate the > character directly using +, emit it four times, and then clear the cell with [-] for good hygiene.
Now we enter a loop, printing out the ++++⋅⋅⋅>>> sequence for each byte. For each cell, we generate a spare + (in order to print), and then loop, printing that + an appropriate number of times. Because looping involves counting down, we also have to copy the initial byte to one of the scratch cells. I copied it to two of them for good measure, which ended up being unnecessary:
Generate a plus in the next cell over: >+++++++++++++++++++++++++++++++++++++++++++< Write out N pluses and copy the initial N twice: [>.>+>+<<<-]> Turn the plus into a greater-than and print four of those: +++++++++++++++++++.... Now move to the next byte: >>>
The end result of this is that we start with a tape that looks like
[<N> 0 0 0]
print out N +'s and four >s, and leave the tape as:
[0 '>' N N]
with the head on the next value of N. We wrap this in a loop, and we've now output the data, as it originally appeared.
We now reset back to the start of the tape again:
<[<<<<]
And walk over it once more, this time directly printing out the value at each cell:
>>>[.>>>>]
Now that we've written the code, we need to encode it in the appropriate format. This is pretty quick in a Python repl:
>>> import re >>> s = open('in.bf').read() >>> t = re.sub(r'[^]+><.[-]', '', s) >>> print '>>>>'.join(['+' * ord(c) for c in t])
We can now concatenate these files, and run the result. It doesn't quite match the input, because the input contained additional whitespace (we wanted it to be at least vaguely human-readable!), but if we pass the output through a brainfuck interpreter again, we observe that it is unchanged:
[nelhage@anarchique:~/nelhage/bf]$ gobf test.bf > out.bf [nelhage@anarchique:~/nelhage/bf]$ cmp test.bf out.bf ; echo $? test.bf out.bf differ: byte 5, line 1 1 [nelhage@anarchique:~/nelhage/bf]$ gobf out.bf > out1.bf [nelhage@anarchique:~/nelhage/bf]$ cmp out1.bf out.bf ; echo $? 0
Voilà!
1 note
·
View note
Text
Things I learned writing a JIT in Go
I was at Gophercon last week, and the last day, Saturday, was a hack day where people all sat down and just worked on projects, largely in Go. I decided on a whim to play with doing runtime code generation in Go. I’ve done some toy JIT work before in C and C++, so I’m pretty familiar with the space, and it seemed like something fun I hadn’t heard anyone playing with in Go.
After a few days of hacking, I produced some working code, complete with a PoC brainfuck JIT, entirely written in Go. I figured I’d write up some of the many things I learned in the process.
Go/plan9’s assembler is weird
Go has its own assembler, inherited from Plan 9. Here’s a sample, defining a function to add two numbers:
// add(x,y) -> x+y TEXT ·add(SB),0,$0-24 MOVQ x+0(FP), AX ADDQ y+8(FP), AX MOVQ AX, rv+16(FP) RET
Go assembly is not translated directly to x86 machine code, as you may be used to. Instead, the intermediate form between the compiler and the linker (or between the assembler and the linker) is an “incompletely defined instruction set” which the linker then does instruction selection and code generation over. So, while many instructions in a go assembly file will map directly to x86 opcodes, many others are pseudo-instructions, that may get turned into one of several concrete instruction sequences at link-time.
The linker also does more aggressive code transformations. For example, Go’s variable-size and/or split stacks are implemented nearly entirely in the linker. The $0-24 above says that this function uses 0 bytes of stack, but has 24 bytes of arguments (which live on the caller’s frame). The linker takes this information and inserts stack-expansion preambles as needed. Because the linker sees the whole program, it can even do interprocedural analysis to explore the stack sizes needed by entire leaf function call chains and optimize accordingly.
Notice also that the function being defined is ·add. Go symbols end up in the resulting objects using their fully-qualified names, e.g. github.com/nelhage/gojit.Alloc. However, since / and . are punctuation character in the C and assembly syntax, they’ve extended the tools to accept U+00B7 MIDDLE DOT (·) and U+2215 DIVISION SLASH (∕) in the input, which get converted to “normal” dots and slashes, as Russ Cox explains on the mailing list.
Note the reference to the SB and FP registers, which don’t exist on x86. SB is the “static base” pseudo-register, which refers to the base of memory — all references are required to be relative to a register, and SB is how you specify absolute addresses. The linker will select an appropriate x86 addressing mode. Similarly, FP is the pseudo-register pointing to the base of our stack frame; Typically this will turn into an access relative to %rsp, with the offset adjusted appropriately.
Go and the plan9 C compilers have their own ABI
Go, and C code compiled by 6c don’t use the “normal” SysV ABI and calling convention, but have their own. It’s striking in its simplicity compared to what you may be used to:
All registers are caller-saved
All parameters are passed on the stack
Return values are also returned on the stack, in space reserved below (stack-wise; higher addresses on amd64) the arguments.
In general, this seems to be consistent with a pattern among the plan9 tools of preferring simplicity and consistency over marginal performance gains. It’s really hard to say that they’re wrong for making that choice.
You can see this at work in the ·add example above, where we pull arguments from 0(FP) and 8(FP), and return the sum by writing back into 16(FP). This also explains why we had a 24-byte argument frame, despite only accepting two eightbyte arguments — the return value also gets a slot in the argument frame.
Go funcs are represented as a pointer to a C function pointer
You can read Russ’s writeup for more information, but Go (as of 1.1) represents funcs as a pointer to a C function pointer. On function entry, a well-known “context” register (%rdx on amd64, or DX as go calls it) pointer pointing at the function pointer. This is used to implement closures — the C function pointer is followed by context information, and the pointed-to assembly code knows how to extract that information from DX in the correct way and then proceed onwards.
The same representation lets you handle the case where you have an r io.Reader and capture f := r.Read — in that case, f will be a pointer to a block of memory like so:
f --> | io.Reader.Read·fm | { r ... }
where io.Reader.Read·fm (note the middle dot — because of the above-mentioned translation, it is impossible to refer to this symbol from human-written code, even in C or assembly) is a compiler-generated stub that knows how to extract r out of DX and invoke Read on it.
You can see how this works in great detail by reading the compiler output.
Static linking has many implications
Because Go is always statically linked, the linker can see the entire program at link-time, and they use this fact. The interprocedural analysis on stack depth I mentioned above is one case. Go also assumes that the entire Go symbol table is available at link-time, and loads a version into the binary, which is used by, among other places, the garbage collector when tracing memory!
When working on my JIT, I had code that generated calls from JITed code back into Go. I found Go occasionally printing errors like
runtime: unexpected return pc for io.Writer.Writer·fm called from 0x......
I dug into the runtime to see if I could somehow inform the runtime of the address of my JITed code, but nope: That error comes from code that does a single binary-search on The Symbol Table and bails if the caller PC isn’t there.
Once go gets dynamic linking (I understand it’s on the way!), that will probably also expand the range of things I can safely do from a JIT :)
cgo is slow
I’d heard this, but I got a chance to observe it first-hand. In an effort to make my JIT safer, I coopted the cgo runtime to run JITed code on a C stack via cgo. The overhead difference is huge (for the trivial go -> jit call):
BenchmarkEmptyCall 500000000 3.43 ns/op BenchmarkEmptyCgoCall 50000000 60.6 ns/op
The overheard of calling back into Go from cgo is even greater (this is benchmarking go -> jit -> go call chains:
BenchmarkGoCall 500000000 5.07 ns/op BenchmarkCgoCall 10000000 250 ns/op
It’s bad enough that my Brainfuck JIT is actually slower than a straight-up interpreter on many simple programs (. and , are implemented via calls back into Go, so the jit -> go overhead comes into play), if you use the cgo version:
BenchmarkCompiledHello 500000 3405 ns/op BenchmarkCompiledHelloCgo 500000 5861 ns/op BenchmarkInterpretHello 500000 5679 ns/op
Go’s testing and benchmarking tools are really fun
Just see the above section for the benchmarking tools!
For the JIT, I ported a simple C++ x86 assembler I’d written for another project. While I never actually wrote a real test suite for the C++ one — it seemed far too annoying — the Go one actually has decent test coverage, and I found many bugs in the C++ version, because Go made it so easy to test.
In conclusion
This was a lot of fun! Iearned a lot about Go’s runtime and toolchain, and also a bit about x86!
25 notes
·
View notes
Text
Slow mongo foreground index builds
Recently, I'd noticed a bunch of cases where MongoDB would be far, far slower to build indexes on secondaries than on the primary. An index build would finish in a few hours on a primary, but then take a day or more to build once the indexing operation replicated to a secondary.
Eventually I got annoyed enough to decide to debug. I threw perf and PMP at a build that was running on a secondary, and they mostly just informed me that the build was spending most of its time comparing BSON objects. That's a pretty unsurprising place for an index build to spending time.
There are a number of ways in which a secondary index build is different from an index build on the primary, but an obvious difference is that (in MongoDB 2.4 and earlier), all index builds on secondaries are "foreground" builds, whereas any sane site does background builds on their primary. Foreground builds block the entire database for their duration, but are allegedly faster and result in more compact indexes. I decided to do a simple benchmark of foreground vs. background, to validate the hypothesis that the allegedly-faster foreground builds are slower. I wrote a simple benchmark, and got the following results building a trivial index on ~6M items:
Done. items=6250000 fg=135.7 bg=95.4 Done. items=6250000 fg=125.1 bg=96.9 Done. items=6250000 fg=125.6 bg=95.8 Done. items=6250000 fg=122.5 bg=92.3
i.e. the fg build is indeed substantially slower.
At this point, to the source-code! Background and foreground builds use basically completely different algorithms. A background build creates the index metadata, and then iterates over the collection and adds each entry to the index:
while ( cc->ok() ) { BSONObj js = cc->current(); try { ... addKeysToIndex(ns, d, idxNo, js, cc->currLoc(), dupsAllowed); ... cc->advance(); } … }
Since the index is just a b-tree, the performance of a background index build should look like O(N log N) with a relatively large base on the logarithm, and presumably the constants should be closely related to the normal insert performance for MongoDB.
Foreground builds, however, are entirely different. "phase 1" of the index build walks the collection and collects all the index keys, and then "phase 2" sorts the keys and does a bottom-up B-Tree build from the sorted records. The idea, presumably, is that the sort+build can be faster and build a more optimal B-Tree than the incremental background build. A comparison sort still can't beat the O(N log N) we saw before, but constant factors mean a lot, especially when there's disk involved.
The sort is done using a BSONObjExternalSorter, so let's dig into that.
Objects are added to the sorter using .add(). This method adds the object to _cur, which is a "FastArray", aka a preallocated in-memory array of objects:
Data& d = _cur->getNext(); d.first = o.getOwned(); d.second = loc;
If _cur fills up or we've used up enough total space, we call finishMap to flush cur to disk, and allocate a new buffer:
if ( _cur->hasSpace() == false || _curSizeSoFar > _maxFilesize ) { finishMap( mayInterrupt ); LOG(1) << "finishing map" << endl; }
finishMap sorts the _cur buffer in-memory, and then flushes it to a new file. So, at the end of phase 1, the BSONObjExternalSorter has a whole bunch of sorted files on disk. The caller invokes .sort(), and then can begin a sorted iteration. If all the data so far fits in-memory, BSONObjExternalSorter just does an in-memory sort and iterates _cur, but since we're looking at the large-collection case, we can skip over that, and see that sort() essentially just calls finishMap one final time, so that all of the data is now in k individually-sorted on-disk files.
Once sort() has been called, the index build code requests an Iterator using extsor->iterator(), and walks over the sorted keys in-order by repeated calls to i->more() and i->next():
while( i->more() ) { ... BSONObjExternalSorter::Data d = i->next(); try { btBuilder.addKey(d.first, d.second); } ... }
So let's dig into next(). The core of next() is this loop:
for ( unsigned i=0; i<_stash.size(); i++ ) { if ( ! _stash[i].second ) { if ( _files[i]->more() ) _stash[i] = pair<Data,bool>( _files[i]->next() , true ); else continue; } if ( slot == -1 || _cmp( best , _stash[i].first ) == 0 ) { best = _stash[i].first; slot = i; } }
The gist of this is clear, and we can read more code to verify: _stash is a cache of the next object from each of the k files, and _files is an array of iterators over each of the k files on-disk. next() performs a straightforward loop over each of the k files, selecting the min element, and returns it. Since each of the files is individually-sorted, this will produce a sorted iteration ... but we do O(k) BSON comparisons for each returned element, and, if the total collection is large relative to the size of an individual array, k = O(N) (where N is the total number of elements), for a total performance of O(N²)!
N² is gonna eventually suck, no matter what, but the constants are hugely important here. O(N*N/sizeof(each file)) is always asymptotically O(N²), but if, say, each file is about the size of RAM, it's possible you won't notice until truly huge collections. So let's find out. There are two conditions on which we'll spill to a new file:
if ( _cur->hasSpace() == false || _curSizeSoFar > _maxFilesize ) {
_maxFilesize defaults to 100M (and is not overriden by the index code). hasSpace() is defined by the fixed-size FastArray class used by _cur; FastArray defaults to a capacity of 10,000 items, and is also not changed.
So we spill into one file for every 10,000 index items, or 100 megabytes of data. These files only contain the index keys, so index keys are likely to be well under 10k (MongoDB only supports individual index entries of <= 1k, for one!), so the 10,000 item limit is likely to be the dominant one in practice. 10,000 is, all things considered, a pretty small number for a database, so we should expect the slowness to show up pretty quickly -- and indeed, my benchmarks show that the tipping point where foreground indexes are slower is somewhere in the vicinity of 1M items.
It turns out, by the way, that this issue is fixed in 2.6, so hopefully we'll soon never have to deal with painfully-slow index builds again. The relevant code has been dramatically refactored, and the core merge algorithm now uses a std::heap, restoring O(n log n) performance.
0 notes
Text
Surveying various languages' string-search algorithms
table.strsearch tbody tr { border-top: 1px solid black; }
When Stripe ran our CTF 3.0, I wrote most of level 3, which was a full-text search challenge inspired in part by my own livegrep.
I wrote a naïve implementation, which just looped over the files, read them into memory, and used java.lang.String.contains to check if each file contained the "needle", and we released that implementation as the baseline implementation that contestants needed to improve on.
I also wrote a solution that used a simple trigram index, which was the solution you had to meet or beat to win.
But in the interests of experimentation, I also wrote another solution, which, during the "indexing" phase, slurped all the files' text into a big in-memory structure, and used java.lang.String.indexOf to find the matching index, and worked from there. I strongly suspected that, given how comparatively small the corpus was (about 100MB) that this solution should be fast enough to win, and wanted to confirm or deny that hunch.
To my surprise and disappointment, however, it turned out that this approach was barely faster than the one that read each file off the disk each time! Obviously, at least after the first try, the files would be in disk cache, so there wouldn't be much actual disk I/O in either case, but I had expected that the I/O and filesystem overhead would at least be significant!
After a bit of profiling and experimentation, I began to unhappily form a hypothesis: Maybe java.lang.String's string-search algorithm was just really inefficient! So, I went to the source, and, after an hour of desperately digging through openjdk.java.net, I found the source code for java.lang.String and discovered the horrible truth:
/** * Code shared by String and StringBuffer to do searches. The * source is the character array being searched, and the target * is the string being searched for. * * @param source the characters being searched. * @param sourceOffset offset of the source string. * @param sourceCount count of the source string. * @param target the characters being searched for. * @param targetOffset offset of the target string. * @param targetCount count of the target string. * @param fromIndex the index to begin searching from. */ static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset, int targetCount, int fromIndex) { if (fromIndex >= sourceCount) { return (targetCount == 0 ? sourceCount : -1); } if (fromIndex < 0) { fromIndex = 0; } if (targetCount == 0) { return fromIndex; } char first = target[targetOffset]; int max = sourceOffset + (sourceCount - targetCount); for (int i = sourceOffset + fromIndex; i <= max; i++) { /* Look for first character. */ if (source[i] != first) { while (++i <= max && source[i] != first); } /* Found first character, now look at the rest of v2 */ if (i <= max) { int j = i + 1; int end = j + targetCount - 1; for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++); if (j == end) { /* Found whole string. */ return i - sourceOffset; } } } return -1; }
in other words, all the substring-search methods on java.lang.String use essentially the most naïve string-match algorithm imaginable, which will run to O(n*m) runtime in the worst case!
It turns out that virtually every other language I know of uses an optimized string-search by default, which had the upshot that simply rewriting our Scala code in Ruby(!) would actually make the code dramatically faster and pass our benchmarks! "Oops".
But inspired by this, I decided to go do a brief survey of common language/VM string-search algorithms, and here's the results:
Lang/runtimeAlgorithmreference JavaNaïve[source] glibc strstrtwo-way, with a Boyer-Moore shift table for long needles[source] golangRabin-Karp for strings.Index, naïve for bytes.Index, Boyer-Moore for strings.Replacerstrings.Index bytes.Index PythonTweaked Boyer-Moore-Horspool[writeup] [source] Ruby 1.8Rabin-Karp[source] Ruby 1.9+Something homegrown with a shift table.1[source] v8Boyer-Moore-Horspool, with some special-cases for short needles.[source]
So it definitely looks like Boyer-Moore or Boyer-Moore-Horspool are winners among the runtimes that have put serious thought into their implementation, although unsurprisingly you need to tweak the pure algorithm to get performance in practice. Rabin-Karp is super-simple to implement and is probably a performance win over the naïve approach (but can't really compete with anything that uses a good shift table), so it doesn't surprise me that it shows up a few times, but not in any of heavier-weight implementations. And Java really is unique in having given no algorithmic effort to their basic string-search -- surprisingly to me, given the incredibly sophistication of the JVM and Java standard libraries in general.
One question I haven't answered and probably won't bother to (but would love to see a good writeup of!) is how well all the optimized algorithms stack up against each other, across a wide range of needle and haystack sizes, with various types of partial matches, and so on.
1 I don't recognize Ruby 1.9's string-search algorithm as having a standard name. I initially thought it was a tweaked Knuth-Morris-Pratt, but critically, unlike KMP, the outer loop doesn't take into account where or how the memcmp fails (thanks to Greg Price for pointing this out to me). The difference is key: While KMP is guaranteed O(n+m) time, you can pretty easily drive this algorithm to be quadratic:
$ pry [1] pry(main)> 12.upto(22).map do |i| [1] pry(main)* needle='a'*(2**(i-1)) + 'b' [1] pry(main)* haystack = 'a' * (2**i) [1] pry(main)* a = Time.now; haystack.include?(needle); b=Time.now [1] pry(main)* t = b-a [1] pry(main)* t, t/haystack.length**2 [1] pry(main)* end => [[0.000150217, 8.953630924224853e-12], [0.000434998, 6.481975317001343e-12], [0.001405982, 5.237691104412079e-12], [0.00521789, 4.859538748860359e-12], [0.02278149, 5.304228980094195e-12], [0.091578843, 5.330590240191668e-12], [0.397198244, 5.779995175544172e-12], [1.651885885, 6.0095258413639385e-12], [8.639180912, 7.857289267121815e-12], [44.239053054, 1.005879609101612e-11], [265.964429048, 1.5118327442451118e-11]]
(p.s. if you're trying to demonstrate quadratic behavior, you know you're onto something as soon as you have to go get coffee waiting on your microbenchmark)
2 notes
·
View notes
Text
Monitoring EventMachine using /proc
So we run a bunch of EventMachine at Stripe. I personally hate EventMachine, but it's what we've got, and it's probably the best answer if you really want async I/O in ruby.
One question you inevitably find yourself asking the question: How close is my EventMachine worker process to capacity? How many more requests/second can this worker handle?
This is, frustratingly, not a super straightforward question. Because of the asynchronous nature of EM, you might have multiple requests logically in-flight at a given moment, but all asynchronously waiting on I/O, and so the process is quite able to handle more. You can look at CPU usage, but because Ruby is much less good than, say, node.js at being religious about async I/O, our workers do also do a fair amount of synchronous I/O -- so a worker might be burning very little CPU, but spending all its time inside synchronous I/O, and thus not have any spare capacity.
Thinking about it more, we decided that the metric we wanted to measure was, "what fraction of the time is this worker spending waiting inside the EventMachine main dispatch loop?" -- that is, essentially, exactly "what fraction of the time is this worker available to handle additional requests?"
Unfortunately EM doesn't expose any easy hooks (that we found, at least) to measure this. We considered doing some truly awful monkeypatching, but rejected that idea as too risky and grody (although we later learned that Conrad Irwin, developer of a bunch of nice ruby software, had written this monkey-patch already and packaged it -- depending on your tastes for other peoples' monkey-patches, this may be a better idea).
But thinking more, we realized that we don't need any hooks into EM at all: The question we are trying to answer is phraseable in terms of a simple question about the Linux system-call layer: "What fraction of the time is my process spending in the main eventmachine select() system call?"
If we can recognize that externally, we can answer this entirely from outside the process! So, what does an EM process sleeping in its event loop look like? Let's answer this two ways:
First, let's go to the source. We trace the Run() method in em.c, and (knowing that we happen to be using the epoll backend) find a _RunEpollOnce method, the meat of which is simple:
FD_ZERO(&fdreads); FD_SET(epfd, &fdreads); if ((ret = rb_thread_select(epfd + 1, &fdreads, NULL, NULL, &tv)) < 1) { if (ret == -1) { assert(errno != EINVAL); assert(errno != EBADF); } return true; }
So assuming that rb_thread_select is a simple wrapper around select, we expect to see a process in a
select(N+1, [N], NULL, NULL, &timeout)
call, where N refers to an epoll fd.
We can verify this via strace, of course. We run
[nelhage@anarchique:~]$ ruby -reventmachine -e 'EM.epoll; EM.run {}'
and then:
[nelhage@anarchique]$ strace -p 30413 Process 30413 attached select(8, [7], NULL, NULL, {0, 83342}) = 0 (Timeout) select(8, [7], NULL, NULL, {0, 90000}) = 0 (Timeout) select(8, [7], NULL, NULL, {0, 90000}) = 0 (Timeout) select(8, [7], NULL, NULL, {0, 90000}) = 0 (Timeout) select(8, [7], NULL, NULL, {0, 90000}) = 0 (Timeout) select(8, [7], NULL, NULL, {0, 90000}^C Process 30413 detached <detached ...> [nelhage@anarchique]$ ls -l /proc/30413/fd/7 lrwx------ 1 nelhage nelhage 64 Mar 27 23:06 /proc/30413/fd/7 -> anon_inode:[eventpoll]
Ok, great. But how do we do this programmatically? We could use strace, but that seems heavyweight and overkill. We could write something ourselves using ptrace (I've done it before!), but custom C code twiddling around my production services with arcane kernel interfaces is not my idea of fun. (Ok, well, it is kinda my idea of fun, but that doesn't make it a good plan...).
But we really don't need very much information ... just the current system call ... and it turns out that /proc has this for us!
[nelhage@anarchique]$ cat /proc/30413/syscall 23 0x8 0x19a0240 0x0 0x0 0x7fffac3e0910 0x3f 0x7fffac3e0880 0x7f938ef7ea53
So the first number is the syscall -- on amd64, 23 is select.
The next values are the arguments. 0x8 is the "max FD +1" number. Next are the (read,write,except) fd_sets. We can't look inside them trivially, but we can verify that only read is set, and that the timeout (0x7fffac3e0910) is non-NULL. Everything else is garbage, because select only has 5 arguments.
We know from this that the nfds argument is 8, which means that the epoll fd should be 7, and we can confirm that using /proc/$pid/fd as shown above. So putting it all together we get this ruby function. We can run that inside a monitor process that discovers all our workers (at Stripe, we do this via the einhorn command socket) and periodically polls each of them, sampling to determine if they're in the EM main loop. We dump this into graphite, and, sampled across time, it gives us pretty good visibility into when it's time to spin up more worker processes!
Note I've only verified this with EM around 1.0.0 on ruby 1.8 and 1.9, and it depends by construction on using epoll on Linux amd64. It's probably tweakable to other environments.
2 notes
·
View notes
Text
Poking inside Python closures
So this morning, a friend was bitching about some Python code he'd inherited and was trying to debug. The author of the code, in a fit of insanityencapsulation, had written code using a bunch of nested closures, like so:
def f(): def g(): return "hello this is g" # do something with g()
He wanted to poke at this code in a REPL, and in particular, was hoping to call g(), but couldn't because it wasn't accessible outside of the function. I made an offhand remark about poking inside the function object, since I remembered Python exposes a whole bunch of internals under various "dunder" variables and methods, and then decided to dig in. Starting from the above example, I started poking around on f in a python shell. I found a bunch of promising things under f.__code__, but no references to g. But then I realized that g doesn't even exist until f has been invoked -- because of how Python works, the body of f is parsed up-front, but nothing inside it, not even the nested def is executed until f is called. So I called f() again and kept poking. This time, by looking at random things under f.__code__, partial success!
>>> f.__code__.co_consts (None, <code object g at 0x7ffe113d47b0, file "<stdin>", line 2>)
So we have the "code object" for g, which in Python is essentially a reference to the compiled bytecode of the function, but not the actual function object itself. This again makes sense -- the text of the body of g, and thus the resulting bytecode, is a constant forever and so can be cached between runs of f, but every time f runs, we instantiate a new function object, since semantically the def is run anew. And on further investigation, the co_consts value was always there -- I just hadn't noticed it. But at this point my friend points out that my f example above is too simplistic: his code actually returns a function that has all the sub-functions in its scope, and maybe we can get at them from there:
def f(): def g(): return "hello this is g" def h(): return "the g says:" + g() return h
All at once the problem is completely changed! h has a reference to g -- the actual, concrete, g that was generated on the relevant call to f() -- and so there's surely a way to get at it. I jump immediately to poking at f's "closure", since that's where I'd expect this stuff to live, and:
>>> h.__closure__ (<cell at 0x7f6a076a7bb0: function object at 0x7f6a076b12a8>,)
That looks promising! But how do I poke inside a cell?
>>> dir(h.__closure__[0]) ['__class__', '__cmp__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'cell_contents']
cell_contents sounds good, and behold:
>>>h.__closure__[0].cell_contents <function g at 0x7f6a076b12a8> >>> h.__closure__[0].cell_contents() 'hello this is g'
From a repl we can just poke over all the entries in __closure__ to find the one we want, but in general how do we figure out which things are named what inside h? Well, in PL theory, "g" is a "free variable" inside h, so let's look at that co_freevars I happened to notice earlier:
>>> h.__code__.co_freevars ('g',)
I'm 99% certain there's a 1:1 correspondence between co_freevars and __closure__. And note that it makes sense that the variable names are on __code__ but the values are on h: The variable names are a property of the compiled code of the function, and shared between all instances, but the attachment to specific runtime objects is specific to this particular invocation of f and value of h. It all makes sense!
0 notes
Link
Trials and tribulations with Python, subprocess, and unix signals
0 notes
Link
Let’s start by backfilling some links…
tracking down a single-bit memory corruption
0 notes