# The Collatz Conjecture

The Collatz conjecture (or 3N+1 problem) is named after Lothar Collatz, who first proposed it in 1937. Wikipedia. Start with any integer. If it is odd, multiply by three and add 1. If it is even, divide by two. The conjecture is that it always converges on 1, then cycles 4, 2, 1, 4, 2, 1.

This has not been proved, and may be undecidable. If there are counterexamples, they (1) either diverge towards infinity forever, or (2) cycle in a loop. No divergent integers (of which there would be an infinite number) or loop (a finite number of integers) have been found, which of course would disprove the conjecture. The loop would have a vast number of very large numbers, the size of the loop related to the binary expansions of the ratio of ln(3)/ln(2).

## Collatz, Improving the Definition

An infinitely large integer, on average, converges in log(infinite) cycles - also infinite. Is this actually "convergence"? What if a sequence of numbers needs 2*log(infinite) cycles, or 10*log(infinite) cycles, or N*log(infinite) cycles? What a sequence needs another log(infinite) cycles after we lost hope and gave up computing with our hypothetical Collatz computer? Unlike many other problems of large numbers (like Fermat's conjectecture), a number problem based on repeated calculations of infinitely large numbers has an inherent infinity to it, and may be undecidable until someone decides on a more defined statement of the problem. Those better defined statements - such as "for any group of N bit numbers, there will be less than N nonconvergent values, may actually be provable or disprovable. The whole problem might be divided into a few well-defined subproblems, and the ability to define the subproblems may be more valuable than a vague answer to the original vague conjecture.

## Collatz, Logically

The Collatz problem can be mapped onto an infinite array of shifters and adders - all binary logic. Since the proposition we wish to test is also logical, then the infinite array of gates can produce a logical signal - the conjecture is true or false - in K*log(infinite) time, where K is (probably) a finite number of hypothetical gate delays multiplied by the log().

It is profitable to think about this hypothetical machine. Theoretical logic design, and especially fault testing of logic, has received a lot of academic attention over the last few decades. Proving that a logical machine (or a program) is 100% functional and without flaws ("stuck at zero or one" or too slow), without testing every possible input, is an incomplete problem, but the theoretical attention it has received might be applied to the Collatz conjecture, with practical results that might prove valuable to manufacturers of digital logic machines.

The Collatz problem also **maps easily onto a Turing machine**, and that has been used to develop theorems about computability and the halting problem. With proper remapping, Collatz may prove to be equivalent to the halting problem, or extend it to new realms. Either result would be useful to the programming community.

Perhaps pure number theoreticians care about Collatz purely because of the problem itself, but there are practical benefits to the search for the proof. Maximizing the benefits and leveraging off existing mathematical systems developed for other practical problems might be one way to roadmap a search for new knowledge about the Collatz conjecture. Elegance is nice, but sometimes you should move stuff with a dump truck rather than tweezers.

## Parallel Collatz Computation - Gate Array Implementation

A "multiply" is not necessary; this can be constructed with a binary adder and a shifter, and hundreds of evaluation circuits could operate in parallel on a fast gate array. If N is odd, then 3N+1 is always even, so there is always at least one divide by two. So, adding N, plus N/2 (the same bits shifted towards the least significant bit), plus a carry into the bottom of a binary adder, is the same operation.

Shifting can be done with a "barrel shifter", with multiple stages shifting zero or one position, zero or two positions, zero of four positions, etc. as necessary. However, a "shift by one" occurs only 1/2 of the time, a "shift by two" 1/4 of the time, etc, so if we iterate with a single shift we on average add only two shift times, which is probably cheaper than adding a lot of complex logic for a barrel shifter, and we only have to look at one or two least significant bits to decide if to do the operation.

The addition is the most complex operation, since we need to propagate a carry through the whole sum. However, adder logic is sped up by using carry/propagate logic and heirarchical chains of blocks. This reduces carry propagate time from o(N) gate delays to o(log N). **Details:** Half sums are computed for each pair of input bits, generating a partial sum (using an exclusive OR), a generate signal (if both input bits are one, an AND gate) and a propagate signal (if one input bit is a one, an OR gate). Adder bits are grouped, typically in blocks of 4, with the whole block producing a block propagate signal (all bit propagate signals within the block are 1) or a generate signal (one of the bit generate signals within the block is one, along with all the subsequent bit propagate signals within the block). These signals are fed to higher level groups of blocks, calculating carries for 4 bits, then 16 bits, then 64 bits, then 256 bits. After the global carries are propagated, we go through another set of blocks to compute the carries for each bit pair. Then we generate the output sum with the first sum and the carry signal into another half adder (an exclusive OR) Given speed limitations associated with moving signals far across a chip, the higher level blocks may use three or two input carry/propagate blocks.

Since the least significant bit from the adder tells us whether we will do a subsequent add or shift (which include the same N/2 part), the question is whether to add N or not, and whether to wait for carry propagate, or just blow the bits through a shift down and move on to the next cycle.

Upper bit zero detection can be delayed through more pipeline logic. If the upper bits are delayed, and our 1/2/4 cycle detection is delayed by a few clock cycles, this is a win overall if we can reduce the number of logic stages and increase the clock rate. Or better, keep a moderate clock rate, reduce the logic swing, and increase gate delay to lower the power dissipation. A gate array full of such circuits will be much more power efficient than a microprocessor running a stored program, but if there are hundreds of these circuits running in parallel on the gate array it could get hot.

After all that perseveration, the simplest way to do this is to shift N down every cycle making N/2, but add the unshifted number and a carry if the LSB is one. That is, the result of every step is N/2 + (N+1)&LSB . A straightforward gate array adder of bit length B, using carry propagate, gets larger as B ln(B), and the step time gets larger proportional to ln(B). However, if we pipeline the upper bits and delay the zero detection, the step time remains small, with some ln(B) proportional cleanup time at the end of our calculation. The average number of steps is also proportional to ln(B).

This would act as a coprocessor; a CPU would generate the candidate number to compute, and shove new numbers at each gate array "cell", after pulling the results of the of the previous calculation out of that cell. Adding arithmetic comparisons to each Collatz cell for "highest value reached", "last descent below original value", hailstone cycle counts, etc. would each cost as much time and logic as the original calculation, though simple calculations (how many cycles, highest carry generated, etc. would be cheaper). We could use the gate array to survey for "interesting" numbers, then analyze them in detail with the main CPU or with a slower, more complicated, and more expensive Collatz cell.

method |
Clocks per Cycle |
Cycle speed |
Logic speed |
Multithread |
Power |

Standard 64 bit CPU |
B/64 +overhead |
many instructions |
Better |
16? |
Very High |

Gate array with standard adder |
1 |
B |
Slower |
400? |
Moderate |

Gate array with carry propagate |
1 |
ln(B) |
Slower |
200? |
High |

Gate array, cp, pipelined |
1 |
1 |
Slower |
100? |
High |

|| Custom logic, cp, pipelined || 1 || 1 || Optimum || 1000? || High || Huge NRE ||

Using a gate array to develop a "normal" 256 bit calculator for the usual explorations, then a very large carry propagate adder (64K bit words?) for the exploration of very large numbers that are potentially part of large loops, would be a lot faster than the usual multiprecision calculations. The discovery process for large loop candidates might also benefit from a gate array. But at some point, the only way some of these very large calculations will complete in an acceptable time is with a full custom chip design (or two), which might cost $1M to design and build the chips, for a "Moon Shot" calculating system. This will happen when the task involves dozens of researchers and thousands of gate arrays in boards, and will be for cost and runtime reduction.