Monday, September 2, 2019

Changing Orders of Magnitude

As my little startup grows I’ve had some time to think about what sort of scaling I like. I think about orders of magnitude as a good indicator of the challenges. I'm not talking about number of employees, as I have bounced this concept off of some of my coworkers that's what they all assume, but rather number of products made. So to get into it:

First is zero to one. That in many respects is the hardest order of magnitude because you go from nothing to something. Advantages: you don’t have to worry about tolerances, because as long as it fits it works, if not pull out the die grinder.  Disadvantages: performance might be all over the map, it will probably take a long time to make it, it might fail quick, like within 120 feet like the Wright brothers first airplane crashed after four flights. 

At this stage even things I consider engineering and science are really in large part an art project. At my old company we called these builds "mules" because they were cobbled together to test a very specific set of variables. Typically only one was built, although sometimes two were. While we then qualified and warrantied our eventual products to 8,000 hours, a mule might only see 100-200 hours of operation to confirm that some big technical issue worked. 

A fun story, which was stressful at the time, back in the fall of 2012 I believe I was part of a big project, a $150 million dollar program, as a very junior engineer. Prior to my joining two mules had been made to test a new transmission, one for vehicle type A and one for vehicle type B. Vehicle type A was a success, the transmission worked well, but vehicle type B the same transmission just did not have the shift timing required, so about six months before we were planned to build 24 prototype machines, we made a program change to a different transmission. It was a big change that required a lot of design updates, analysis, and tooling changes. It was the right decision and shows the value of "failing" when you make a sample of one.

The next step is from one to ten. Suddenly tolerances start to matter, but you can still pull out a die grinder and make it fit if needed. There are small economies of scale, amazingly the difference between ordering one part and two parts is about 25%. In other words, to buy just one might cost $100, but to buy two might cost $150, and three might be $190. 

The same program I mentioned earlier, in November of 2013 when we had the 24 prototype machines I think six were in the field collecting hours. Suddenly we had a failure that we had not seen before. Turns out I had failed to account for a side load in my FEA the year before. So in the space of two days I confirmed the FEA with the new load case, we made some steel plates, and on the third day flew to Canada to oversee the weld repairs to fix the machine. We ended up having to rework basically all the machines with that particular feature, about 12 in all. 

At this point you can joke about being in production, although even if you try to keep things the same there will be inevitable differences, which might amount to several percent of performance. 

From 10 to 100. This is my favorite order of magnitude to transition across. Tolerances really matter at this point. If you screw it up and have to rework 100 parts, suddenly that’s a big deal. You get sizable discounts on machined parts. Castings and forgings are possible and may be financially justified at this scale depending on the application. At this stage problems are more expensive. 

I like this transition because in addition to the problems typically being more complex (like tolerance stack-ups with 15 parts) there is a level of consistency and volume that gives it a feeling of a real business. A big upside is that certain things become routine, as in every product achieves a basic level of performance. 

As an example, in the winter of 2014-2015 that $150 million program went to production. I was responsible for a very large welded structure. I reviewed the dimension inspection reports for each of the first 20 structures, and we were seeing 100+ points out of tolerance, I went through and listed things that were unacceptable and things that were acceptable, in a four page email. One of the dimensions was the overall length of the structure. It was up to 12 mm too short. I didn't know if that was bad or good. Turns out the other parts that mounted to this structure did not have the tolerance stack up correct fit. We ended up grinding hundreds of holes larger to make the parts fit, until we realized that adjusting the welding fixture on the large structure would solve the problem. 

100 to 1000. At this point you are probably outsourcing some operations to companies that will be doing them full time. Companies that make one of your components will make that during all working hours because the volumes are that high. At this point mistakes become quite expensive. My most expensive mistake was failing to correct an interference between two parts that ended up having over $500,000 in warranty claims in 2015 and 2016 before the design and tooling was updated in late 2015. 

At this stage, things start to happen fast. Mistakes are propagated uncomfortably fast. Making a change, after 100+ have been made an old way is hard and expensive. There is still an air of improvisation, most product lines will have individual product deviations from perfect and dimensions that are out of tolerance (unless the designers are amazing). Production is low enough that if you have to stop the assembly line for a day or a week or even a month, it's not the end of the world. Most likely profit margins are still high enough to tolerate that kind of pause. 

1000 to 10,000. This is where automation starts to become a bigger factor. To get programs set up and running without constant stoppages takes a longer ramp up. At lower levels of manufacturing humans can basically assemble it all, but at this point there will likely be fixtures and keep out cages to keep people away from interfering with the robot operation. 

At this point there will be many continuous improvement changes that at a lower level of production might have been recalls or reworks for all of the parts built. This is why most engineers with some experience don't like to buy the first model year of a new vehicle. There are dozens of changes that need to get ironed out that you don't even know about until you get to a higher volume. Unfortunately sheetmetal and plastic parts are often one of the issues. A skilled assembler will be able to make the part fit by loosening and tightening various mating parts, but then a new person will try it and nothing will fit. Inevitably the design will need to be changed to make it more robust. Similarly, long term drivetrain issues often start to appear at this magnitude. Gears and bearings are often robust enough to last thousands of hours, but minuscule misalignment will often destroy them. In other words, this is the reliability of vehicles in the 1960s, 1970s. 

10,000 to 100,000. This is the limit of my experience. I’ve never worked on a part with over 60,000 used per year. Stamping is huge at this point, because it’s super fast and repeatable. Mistakes are different at this stage. Programs are typically large enough with enough testing that the really low hours, gross errors, don't happen any more. Improvements are typically based on warranty claims at this point. It may be millions of dollars in warranty claims over years, but it can take awhile for those to show up. They are often due to specific conditions. The largest warranty issue I worked on had something like $2 million in warranty claims over seven years and it took us more than half a year to figure out how to replicate it in the lab. It was quite debatable on how to go about fixing it. 

At this level statistics becomes far more important. At lower levels of production when there is an issue, you just go fix it. You might have to parallel path multiple different fixes if you aren't totally sure what will do it, but at this level there are random failures that statistically are not worth the time and effort to go and fix. For example, I have only ever heard of one belay loop breaking on a harness ever. He was a famous climber, so it's a somewhat well known event. As a result the climbing companies did some testing and determined it was so rare no big changes were needed. In the automotive world, this is basically the range of statistical confidence that manufacturers get into when they are doing their prototype testing. It becomes much more expensive to fix these "little" issues or even find them when testing 100 prototype cars. 

From 100,000 to 1 million is beyond anything I have worked on. However, I am familiar with a couple stories, when bolts have a bad steel lot for example, or the base steel material is defect, it can be nearly impossible to determine when the problem began. When you see huge automotive recalls this is because they are getting into corner cases of corner cases. The Takata airbag saga is still another level or two beyond this level, but it is illustrative of the issues at this level. Somewhere around 20+ people died and 250+ injured worldwide and there are 45 million affected vehicles that had those airbags. Finding that kind of error in a typical testing program is not guaranteed. I'm sure after one or two people died the engineers and managers scratched their head and wondered, is it a bad design or a random occurrence? Kind of like Teslas having crashes while on autopilot. It's hard to say from such a small sample size with much confidence what the real issue is. People slip and fall and die in the shower all the time, but we don't quit taking showers. 

In short, while I really like statistics, these huge orders of magnitude do not really interest me because the problems can be so complex with so many variables that a solution can easily be worse the the original problem, even with additional testing. That doesn't mean we shouldn't try for improvement, rather that the reality is we will never achieve it. In other words, there will always be engineering jobs, even if something happened to reset the world back to zero and we had to start over again. 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.