Why does the described algorithm only decode transmitted blocks when the receiver has decoded all but one of the constituent blocks? For example, if the receiver has received the following 3 blocks:
1. A ^ B ^ C
2. A ^ B
3. B ^ C
it can combine #1 and #2 to decode C, then combine C and #3 to decode B, then combine B and #2 to decode A.
The algorithm as described would in this situation cause the receiver to wait until it's received a single block, either A or B or C, before decoding anything. This strikes me as inefficient.
Edited to add: I think this is analogous to forward vs. backwards chaining: you can either start by combining single blocks with composite blocks to simplify them, or start by combining blocks whose component block lists differ by only a single block. Or you could apply both, which should get the greatest simplification with the smallest amount of input.
Normally one would use some distribution of the number of encoded source blocks in a block. So one would send 30% single blocks, 30% double blocks, 15% triple, and so on until you get some relatively small number of high order encoded blocks. This means that while you can find a solution the way you posted it is a rare case that you have no single block for decoding and since in any case you need to receive a bit over 100% of the blocks before you can decode anything there is no real reason to go to exquisite lengths to find the perfect solution when the trivial case will work 100% of the times (well 99.9999% if you cared to be accurate :-)
The algorithm is also not so much about computational efficiency, to check at any point if you can or cannot decode requires quite a bit of work anyhow. The main advantage of the algorithm is in its relative low complexity and high efficiency on the transmission line.
My point isn't about computational efficiency, but bandwidth efficiency (or alternatively, decoding the full message as early as possible). As I understand it, the Moore's law doubling time for bandwidth is longer than for CPU speed, suggesting that processing is getting cheaper relative to bandwidth, so using more CPU in order to have to send fewer blocks is probably a long-term win.
The decoding amounts to solving a system of linear equation where each equation corresponds to one received packet. Assuming the file to be sent was broken into n parts, minimum n independent equations are required for reconstruction. To do that optimally, receiver needs to keep trying to solve the linear equation system once it has received at least n packets. This means O(n^3) in general case for each received packet. If the equations are generated randomly, the expected overhead seems to be a mere 1.61 extra packets! http://virtual.vtt.fi/virtual/abi/abi-seminar07/abiseminar07...
In practice however, depending on the receiver, you may not want it to solve a general system of linear equations for every packet it receives when n is big. It might be better to stick to a simple algorithm (which has a better time complexity if certain conditions are satisfied by the linear system) and just wait to accumulate enough equations for it to become applicable.
If this was transmission to a single client you were right but when this is used for broadcast or multicast there is less interest to minimize the transmission for each client but rather to optimize it for the entire population at once, since in these cases feedback is scarce in the first place there is little reason to check at every message, it is sufficient to test it once in a while anyhow.
I can see also use cases for this in a unicast transmissions but even there the idea would be about simplifying the communications layer, for example in the case of usage for a fixed rate transmission and with complete disregard for the loss rate. Not sure what I'd think about bandwidth usage and the stop criteria in this case.
With large enough blocks and efficient coding, the noisy-channel coding theorem guarantees nearly error-free transmission rates up to the maximum possible channel capacity. Modern LDPC, Raptor, and Turbo codes can get very close (to within 0.0045dB) of this Shannon limit, meaning that far from being less efficient, FEC allows for maximal bandwidth efficiency.
The algorithm as described would in this situation cause the receiver to wait until it's received a single block, either A or B or C, before decoding anything. This strikes me as inefficient.
Edited to add: I think this is analogous to forward vs. backwards chaining: you can either start by combining single blocks with composite blocks to simplify them, or start by combining blocks whose component block lists differ by only a single block. Or you could apply both, which should get the greatest simplification with the smallest amount of input.