Powerset seems like an interesting concept, but reminds me a lot of the NLP projects I saw while I was in college, in that they produce output that almost makes sense, except not really. China "took: actions, actions, and steps", shark "eats: Samuel L. Jackson" (!!!), or my favorite, "computer clogs: system and government".
It seems to do better on phrases that represent very specific concepts like 'Emacs' or 'calendar'. In general though, it combines very specific and very broad concepts (presumably because it can't tell the difference), for instance "Electricity reaches: village and Drewsey".
Still, kinda neat, especially if they can pull in information from other sources and cross it with the Wikipedia data. e2, perhaps?
(Edit: Oh, and apparently they got bought by Microsoft today)
[/net |
(permalink)]
My #1 Most Wanted Google Reader Feature (Right Now, At Least)
Mark all as unread
I just discovered Arch Robison (formerly of KAI C++, now at Intel) has a blog, and I'd like to work through all of his posts since it looks like he has some good stuff to say. But short of going to each post and marking it as unread, I just have to remember which ones I have or haven't gotten to yet. :(
Update: Google Reader won't let me mark posts in Arch's blog feed as unread. Weird. I neatly solved the problem by reading all of his past posts in one go.
[/net |
(permalink)]
A Failure Case in a Linux Random Number Generator
The Linux kernel implements a random number generator called a Tausworthe generator, in the file lib/kernel32.c. The kernel uses this generator for a variety of non-cryptographic purposes, such as calculating network delays and random ports numbers, choosing a random element to drop from full caches, and many other places where a randomized algorithm is useful.
The state of the generator consists of three 32-bit integers, providing 96 bits of state:
struct rnd_state {
u32 s1, s2, s3;
};
These three values must be initialized in a particular way to ensure the generator works correctly. A quote from the original paper describing the generator1 is included in the source to document these conditions:
... the k_j most significant bits of z_j must be non-zero, for each j. (Note: this restriction also applies to the computer code given in [4], but was mistakenly not mentioned in that paper.)
With an additional note that "This affects the seeding procedure by imposing the requirement s1 > 1, s2 > 7, s3 > 15." (In the source, s1,s2,s3 of rnd_state correspond to z_1,z_2,z_3 in the paper).
To generate output, the following transformation is used (the code used in the kernel uses decimal values for the constants, which hides the underlying binary structure):
s1 = ((s1 & 0xFFFFFFFE) << 12) ^ (((s1 << 13) ^ s1) >> 19); s2 = ((s2 & 0xFFFFFFF8) << 4) ^ (((s2 << 2) ^ s2) >> 25); s3 = ((s3 & 0xFFFFFFF0) << 17) ^ (((s3 << 3) ^ s3) >> 11); return (s1 ^ s2 ^ s3);
Particularly notable here is that, if all of s1, s2, and s3 are zero, then the generator will output nothing but zero values forever.
To set up the initial state, the kernel first grabs an unsigned long's worth of randomness from the kernel entropy pool using get_random_bytes (this is the same high-entropy RNG used to feed /dev/random), which it calls s. The code first checks to make sure s is not zero, and if so, sets it to 1. Then, it derives s1, s2, and s3 from s using this procedure:
#define LCG(n) (69069 * n)
state->s1 = LCG(s);
state->s2 = LCG(state->s1);
state->s3 = LCG(state->s2);
Recall that s is an unsigned long, so on 64-bit systems, the product assigned to s1 will also be 64 bits, but truncated down to a 32-bit value. Herein lies the bug: if the low 32 bits of s are zero, but at least one high bit is set, then the comparison to zero will not be true, but the resulting product will have 32 low order bits. After being truncated, s1 will be zero, which will then also cause s2 and s3 to be zero. As mentioned, this will cause the RNG to output nothing but zero values forever. To a first approximation, get_random_bytes will return a uniform random value, so the odds of this happening are roughly 1/2^32, or about 1 in 4 billion. The RNG is seeded at boot time, with one RNG state per CPU (presumably this is to remove locking contention). Taking a pair of WAGs that there are 20 million CPUs running Linux, and that the average uptime of a Linux machine is 10 days, that suggests an all zero RNG output will occur on at least one CPU about every 6 years.
Additionally, the invariant mentioned in the source, that s1 > 1, s2 > 7, s3 > 15, are not actually met with the current code. For instance on 32-bit systems (with sizeof(long) = 4), a seed of 0x4BC54E0A will generate the state s1 = 0x4BC54E0A, s2 = (s1 * 69069) % 2^32 = 2, s3 = 2*69069 = 138138. In a random sampling, I found there were large numbers of these bad seeds for both 32-bit and 64-bit systems. The paper that describes the RNG does not actually specify how badly violating this condition breaks the generator, though, so it's unclear how serious this actually is.
I submitted a patch to fix these bugs to LKML on June 19. While I got no replies, today it was included as part of Andrew Morton's -mm branch, so this bug should be fixed in upcoming kernels.
[1]: P. L'Ecuyer, Maximally Equidistributed Combined Tausworthe Generators, Mathematics of Computation, 65, 213 (1996), 203--213
Roman Proverbs Applicable to Software
Quod non est in actis, non est in mundo. ("What is not in the documents does not exist")
Iowa Loses Years Of Topsoil in Days
- Des Moines Register, 2008-06-22 (full story)Soil erosion has caused substantial damage to Iowa's farm fields hit hard by heavy rain and flooding this spring, Iowa Secretary of Agriculture Bill Northey said after touring two areas of the state.
Heavy rains this spring came when the soil was most vulnerable to erosion - when the soil has been tilled for planting, but before plants' root systems grow to hold it in place, he said. [...]
Richard Cruse, director of the Iowa Water Center at Iowa State University, said areas of the state lost 7 to 8 tons of topsoil an acre in a 24-hour period. Those areas also had multiple one-day periods when that amount of topsoil per acre was lost.
The Natural Resources Conservation Service defines "tolerable" soil erosion levels at 3 to 5 tons an acre annually, Cruse said, adding that he thinks that the "tolerable" level is too high.
[/economy |
(permalink)]
A major part of C++0x is the addition of an explicit memory model, which will allow for safe multithreaded programming (C's memory model, which C++98 inherits, is really only sensible in single-threaded code). At the most recent C++0x meetings in Sophia Antipolis, an important change was voted into the working paper that assists garbage collection, based on the wording in N2586.
In order to allow safe garbage collection, you have to prohibit "pointer hiding". This is where one plays tricks such as taking the integer value of a pointer, xor'ing it with a constant, and then later reversing the operation. In C99/C++98, this is a valid technique and works (though IMO, anyone using it should be beaten with a stick). But doing this makes garbage collection very difficult: if the pointer is hidden in this way when the GC pass runs, it cannot 'see' that the pointer is still there, and might prematurely deallocate the pointed-to object. So in C++0x, while code hiding pointers will remain valid, actually dereferencing the pointer becomes an undefined operation. The expectation is that, in non-GC'ed C++ programs, it will continue to work as it has in the past, while explicitly allowing a GC to reclaim memory without having to worry that a valid pointer to the object might suddenly pop back into existence.
It seems like the committee is mostly considering this as useful in the context of GC'ed programs, or perhaps in conjunction with leak detectors, but there is actually another reason that this is very very useful: it allows a C++ VM to prohibit reference forging. That restriction could potentially be of great value in preventing the sort of pointer forging attacks which C and C++ code is so vulnerable to.
Seen: multithreaded code that carefully allocates, but never uses, mutexes. It seems simply declaring them is sufficient for the compiler to figure out that the code needs to be serialized.
Steve Dekorte asks DRM opponents how they would feel about the bits representing their fingerprints, genome, car keys, and mental state (I would add also personal finances and cryptographic keys) being copied around willy-nilly, and suggesting that perhaps these people really do want DRM, just not for things they want to copy.
Getting DRM to actually work seems to imply replacing our current computers and networks with essentially closed systems. This seems like trading something for nothing, because frankly, the only entities that I would really worry about trafficking in my fingerprints and genome (and my mental state, if that turns out to be readable outside of an MRI tube) is the federal government, who would just write the rules to suit their needs. If there is anything in our modern era that trumps Disney, it's "national security".
[/security |
(permalink)]
- Bloomberg, 2008-02-20Grain farmers will need to harvest record crops every year to meet increasing global food demand and avoid famine, Potash Corp. of Saskatchewan Inc. Chief Executive Officer William Doyle said.
People and livestock are consuming more grain than ever, draining world inventories and increasing the likelihood of shortages, Doyle said yesterday in an interview on Bloomberg Television. Global grain stockpiles fell to about 53 days of supply last year, the lowest level since record-keeping began in 1960, according to the U.S. Department of Agriculture.
"If you had any major upset where you didn't have a crop in a major growing agricultural region this year, I believe you'd see famine," Doyle, 57, said in New York.
- Financial Times, 2008-06-17Consumers were warned to expect even sharper increases in global food prices after US officials said that some of the country's best farmland was facing its worst flooding for 15 years.
Agriculture officials and traders said the damage could push up worldwide corn and soyabean prices, which have spiralled in recent days as floods have swamped crops in parts of Iowa, the US's biggest corn-producing state.
[/economy |
(permalink)]
Thread Safety Annotations for C/C++ using GCC
Le-Chun Wu of Google has posted a patch to the GCC list that adds a set of thread safety annotations. After annotating your code, GCC can detect (and warn about) certain classes of thread safety errors, like shared variables being used without acquiring a lock. It doesn't handle some useful cases (including 'smart' pointers like shared_ptr), but it does a lot, and having it part of GCC means it can be easily run during regular builds.
There is an API/design doc describing the current interface here. Looks pretty decent and I hope it finds its way into a proper GCC release soon. Since GCC 4.4 is still in stage1, there may even be a chance of it being available in a release this year.
(Next 10)