The Collatz Conjecture
The Collatz conjecture (or 3N+1 problem) is named after Lothar Collatz, who first proposed it in 1937. Wikipedia. Start with any integer. If it is odd, multiply by three and add 1. If it is even, divide by two. The conjecture is that it always converges on 1, then cycles 4, 2, 1, 4, 2, 1.
This has not been proved, and may be undecidable. If there are counterexamples, they (1) either diverge towards infinity forever, or (2) cycle in a loop. No divergent integers (of which there would be an infinite number) or loop (a finite number of integers) have been found, which of course would disprove the conjecture. The loop would have a vast number of very large numbers, the size of the loop related to the binary expansions of the ratio of ln(3)/ln(2).
Collatz, Improving the Definition
An infinitely large integer, on average, converges in log(infinite) cycles  also infinite. Is this actually "convergence"? What if a sequence of numbers needs 2*log(infinite) cycles, or 10*log(infinite) cycles, or N*log(infinite) cycles? What a sequence needs another log(infinite) cycles after we lost hope and gave up computing with our hypothetical Collatz computer? Unlike many other problems of large numbers (like Fermat's conjectecture), a number problem based on repeated calculations of infinitely large numbers has an inherent infinity to it, and may be undecidable until someone decides on a more defined statement of the problem. Those better defined statements  such as "for any group of N bit numbers, there will be less than N nonconvergent values, may actually be provable or disprovable. The whole problem might be divided into a few welldefined subproblems, and the ability to define the subproblems may be more valuable than a vague answer to the original vague conjecture.
Collatz, Logically
The Collatz problem can be mapped onto an infinite array of shifters and adders  all binary logic. Since the proposition we wish to test is also logical, then the infinite array of gates can produce a logical signal  the conjecture is true or false  in K*log(infinite) time, where K is (probably) a finite number of hypothetical gate delays multiplied by the log().
It is profitable to think about this hypothetical machine. Theoretical logic design, and especially fault testing of logic, has received a lot of academic attention over the last few decades. Proving that a logical machine (or a program) is 100% functional and without flaws ("stuck at zero or one" or too slow), without testing every possible input, is an incomplete problem, but the theoretical attention it has received might be applied to the Collatz conjecture, with practical results that might prove valuable to manufacturers of digital logic machines.
The Collatz problem also maps easily onto a Turing machine, and that has been used to develop theorems about computability and the halting problem. With proper remapping, Collatz may prove to be equivalent to the halting problem, or extend it to new realms. Either result would be useful to the programming community.
Perhaps pure number theoreticians care about Collatz purely because of the problem itself, but there are practical benefits to the search for the proof. Maximizing the benefits and leveraging off existing mathematical systems developed for other practical problems might be one way to roadmap a search for new knowledge about the Collatz conjecture. Elegance is nice, but sometimes you should move stuff with a dump truck rather than tweezers.
Parallel Collatz Computation  Gate Array Implementation
A "multiply" is not necessary; this can be constructed with a binary adder and a shifter, and hundreds of evaluation circuits could operate in parallel on a fast gate array. If N is odd, then 3N+1 is always even, so there is always at least one divide by two. So, adding N, plus N/2 (the same bits shifted towards the least significant bit), plus a carry into the bottom of a binary adder, is the same operation.
Shifting can be done with a "barrel shifter", with multiple stages shifting zero or one position, zero or two positions, zero of four positions, etc. as necessary. However, a "shift by one" occurs only 1/2 of the time, a "shift by two" 1/4 of the time, etc, so if we iterate with a single shift we on average add only two shift times, which is probably cheaper than adding a lot of complex logic for a barrel shifter, and we only have to look at one or two least significant bits to decide if to do the operation.
The addition is the most complex operation, since we need to propagate a carry through the whole sum. However, adder logic is sped up by using carry/propagate logic and heirarchical chains of blocks. This reduces carry propagate time from o(N) gate delays to o(log N). Details: Half sums are computed for each pair of input bits, generating a partial sum (using an exclusive OR), a generate signal (if both input bits are one, an AND gate) and a propagate signal (if one input bit is a one, an OR gate). Adder bits are grouped, typically in blocks of 4, with the whole block producing a block propagate signal (all bit propagate signals within the block are 1) or a generate signal (one of the bit generate signals within the block is one, along with all the subsequent bit propagate signals within the block). These signals are fed to higher level groups of blocks, calculating carries for 4 bits, then 16 bits, then 64 bits, then 256 bits. After the global carries are propagated, we go through another set of blocks to compute the carries for each bit pair. Then we generate the output sum with the first sum and the carry signal into another half adder (an exclusive OR) Given speed limitations associated with moving signals far across a chip, the higher level blocks may use three or two input carry/propagate blocks.
Since the least significant bit from the adder tells us whether we will do a subsequent add or shift (which include the same N/2 part), the question is whether to add N or not, and whether to wait for carry propagate, or just blow the bits through a shift down and move on to the next cycle.
Upper bit zero detection can be delayed through more pipeline logic. If the upper bits are delayed, and our 1/2/4 cycle detection is delayed by a few clock cycles, this is a win overall if we can reduce the number of logic stages and increase the clock rate. Or better, keep a moderate clock rate, reduce the logic swing, and increase gate delay to lower the power dissipation. A gate array full of such circuits will be much more power efficient than a microprocessor running a stored program, but if there are hundreds of these circuits running in parallel on the gate array it could get hot.
After all that perseveration, the simplest way to do this is to shift N down every cycle making N/2, but add the unshifted number and a carry if the LSB is one. That is, the result of every step is N/2 + (N+1)&LSB . A straightforward gate array adder of bit length B, using carry propagate, gets larger as B ln(B), and the step time gets larger proportional to ln(B). However, if we pipeline the upper bits and delay the zero detection, the step time remains small, with some ln(B) proportional cleanup time at the end of our calculation. The average number of steps is also proportional to ln(B).
This would act as a coprocessor; a CPU would generate the candidate number to compute, and shove new numbers at each gate array "cell", after pulling the results of the of the previous calculation out of that cell. Adding arithmetic comparisons to each Collatz cell for "highest value reached", "last descent below original value", hailstone cycle counts, etc. would each cost as much time and logic as the original calculation, though simple calculations (how many cycles, highest carry generated, etc. would be cheaper). We could use the gate array to survey for "interesting" numbers, then analyze them in detail with the main CPU or with a slower, more complicated, and more expensive Collatz cell.
method 
Clocks per Cycle 
Cycle speed 
Logic speed 
Multithread 
Power 

Standard 64 bit CPU 
B/64 +overhead 
many instructions 
Better 
16? 
Very High 

Gate array with standard adder 
1 
B 
Slower 
400? 
Moderate 

Gate array with carry propagate 
1 
ln(B) 
Slower 
200? 
High 

Gate array, cp, pipelined 
1 
1 
Slower 
100? 
High 

Custom logic, cp, pipelined 
1 
1 
Optimum 
1000? 
High 
Huge NRE 
Using a gate array to develop a "normal" 256 bit calculator for the usual explorations, then a very large carry propagate adder (64K bit words?) for the exploration of very large numbers that are potentially part of large loops, would be a lot faster than the usual multiprecision calculations. The discovery process for large loop candidates might also benefit from a gate array. But at some point, the only way some of these very large calculations will complete in an acceptable time is with a full custom chip design (or two), which might cost $1M to design and build the chips, for a "Moon Shot" calculating system. This will happen when the task involves dozens of researchers and thousands of gate arrays in boards, and will be for cost and runtime reduction.
Odd Integer Starts Only
All even starting integers divide down monotonically to odd integers, so "even starts" are trivial and need not be computed. Odd starts that are divisible by three have only 2^N multiple antecedents, also uninteresting, they cannot have complex webs of antecedents like other integers, nor can they be part of a loop. Every odd integer that is divisible by three reduces to a different odd integer not divisible by 3, or two the "binary backbone chain".
Binary backbone chain
All Collatz sequences that terminate at 1 pass through the Binary chain, a sequence of multiples of 4 starting at 16, and skipping every third multiple (with a 4's exponent divisible by 3) :
4 
1 
1 
16 
2 
5 
64 
3 
21 
256 
4 
85 
1024 
5 
341 
4096 
6 
1365 
16384 
7 
5461 
65536 
8 
21845 
262144 
9 
87381 
1048576 
10 
349525 
4194304 
11 
1398101 
16777216 
12 
5592405 
67108864 
13 
22369621 
268435456 
14 
89478485 
1073741824 
15 
357913941 
4294967296 
16 
1431655765 
... et cetera ...
So if the Collatz conjecture is true,
all odd integers (and hence all integers) terminate in this poweroffour backbone chain, and enter it only via one of the odd entry points in the third column.
Indeed, very nearly all integers enter the backbone chain via the bolded odd integer entry points, with the nonbolded integers (divisible by three) as entry points for only those odd integers themselves (and their poweroftwo even multiples).
A program or logic system that computes Collatz sequences need not detect those entry points, just shift the working integer down by one, testing for a 0 lsb and either continuing to the next shift, or testing for 1 and terminating. If these sequential tests are NOT looped, they will be pretty fast, and the many terminating locations in the code will correspond to the terminating number in the backbone chain.
An interesting 64 bit unsigned computation would start with a maximum odd integer (say 2^{56}), then following this loop, decrementing downwards. If the result of the tests exceeds 2^{63}, then pass those integers to another processor that does double or tripleprecision BIGINT computations. That might take about a year to compute on a million MIPS multicore/multithread CPU. Starting with a 2^{64} integer would probably take more than a century.
Only a miniscule fraction of all those starts would enter the backbone chain above the starts in the table shown. My WAG is that a table twice as long would cover ALL 2^{64} bit starts. But that can't be proven without doing the computations, and the general case for much larger integers can't be proven until the Collatz conjecture is proven.
A nested, pipelined, carrylookahead adder can compute an N bit bigint addition in log(N) time. So it is conceivable that a megaabitinteger Collatz processor could still process Collatz cycles at GIPS rates, but the universe would not last long enough for it to test all the starting integers it is capable of processing
A 1000thread custom logic chip made in 10,000 chip batches would be much faster and less power hungry than a general CPU, but very costly to design and implement. It is probably cheaper to pay a lot of mathematicians to fold the problem down to a smaller logical problem. Enough problem foldings (with the aid of automated logic minimization) could lead to a mathematical proof of Collatz ... which only a giant logical system could understand.
Perhaps the proof of the Collatz conjecture would fill more silicon than the universe contains. That would be an awful way for the world to end ...