Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Monday, October 28, 2013

Data Visualization - Learn By Looking

So the least rigorous form of statistical analysis is simply looking at the data. I've written about this before where you can tell quite a bit about a phenomenon just by looking at the data (no p-values, no alphas... just looking at the data)

Here was that example distribution of test scores:

When you look at the data and there are irregularities or non-smoothness, you're looking at some human intervention... some manual action that does not comport with the natural order.

Have a look at this visualization. It's apparently the average monthly premiums for insurance plans under the Affordable Care Act for each county. Dark blue are plans that cost $250/mo. Dark red are plans that cost $1,250/mo... so the more red, the more costly.

cost obamacare plans
What's most interesting to me is that you can see the shapes of Virginia, Wyoming, South Dakota and New Jersey pretty well on this map according to the price of ACA insurance premiums. Some guy living in Montana is paying $500/month; cross an imaginary line into Wyoming and now it's $1,000.

When you see something like this, you can infer that a non-natural phenomenon holds the true explanation (e.g. state law). There's a step-function here and step-functions aren't found that often in nature.

Now have a look at New England: here, there's a gradient... the farther northeast you go, the more costly the insurance. Likewise in Wisconsin... the closer you get to Minnesota, the more expensive the premium. Gradual changes or smoothness is what we can expect for nature. And a lot of information can be inferred by just looking.

Thursday, August 1, 2013

Every MSAT's Response to Process Development

Reducing variability is the only thing the Manufacturing team can control.  Ways to do this involve getting more accurate probes, improving control algorithms, upgrading procedures, etc.

But there are limits. Probes are only so precise. Transmitter may discretize the signal and add error to the measurement. The cell culture may have intrinsic variability.

What makes for releasable lots are cell cultures executed within process specifications.  And measuring a process parameter's variability in relation to the process specification is the SPC metric: capability.


Process specifications are created by Process Development (PD). And at the lab-scale, it's their job to run DOE and explore the process space and select process specifications narrow enough to produce the right product, but wide enough that any facility can manufacture it.

It's tempting to select the ranges that produce the highest culture volumetric productivity.  But that would be a mistake if those specifications were too narrow relative to the process variability.  You may get 100% more productivity, but at large-scale be only able to hit those specifications 50% of the time resulting in a net 0% improvement.

The key is to pick specification limits (USL and LSL) that are wide so that the large-scale process is easy to execute.  And at large-scale, let the MSAT guys find the sweet-spot.

Thursday, July 25, 2013

Fermentation Analysis Software

There's this neat question on the Mathematical Modeling of Fermentation LinkedIn Group on software used in Fermentation.
I would like to ask about the software for the analysis of your fermentation processes. Software for analysis, but not for the fermentation control. Although, if you can say something about the control programs, it is welcome, too.

I suspect that the people in this group deal with small-scale or pilot plant-scale, but this question is actually worth answering for large-scale cell culture/fermentation.

deltav In 1999, the fermentation control software was basically free-for-all.  No single company had a stranglehold on the market. Allen-Bradley PLCs were popular, Siemen's was popular, Honeywell was a good option... But over a decade, the company that has really taken over the control layer is Emerson's DeltaV system.

The reason this is worth talking about is because the data source comes from instrument IO that is monitored by the control software. All analysis is preceded by data capture, archival and retreival. DeltaV is that software that does the capture.
1) What software is used on your fermentation equipment?
osisoft pi Next up is the system that archives this instrument data for the long-term. DeltaV has a historian, but the most popular data historian is OSIsoft's PI (a.k.a. OSI PI). And the reason is because the PI has stellar client tools and stellar support. PI client tools like DataLink and ProcessBook are good for generic process troubleshooting and support. More sophisticated analysis requires statistical programs.

Zymergi offers OSI PI consulting for biotech companies.

2) What software you prefer to analyze of your fermentations and for your future fermentation processes planning?

JMP This is where there's a lot of differentiation in fermentation analysis software. My personal fave is SAS Institute's JMP software. This is desktop stats software that lets users explore the data and tease signal from noise or truth from perception. I've solved a ton of problems and produced answers to 7-figure problems with this software.

Zymergi offers MSAT consulting helping customers set up MSAT groups and execute MSAT functions.

There are others operating in this space, but I have yet to see any vendor make headway beyond trial installation and cursory usage.
3) Do you agree with the fact that the question of software for fermentation processes doesn't undergo a rapid development now?
All of these tools are not fermentation specific.  They each are superior in their respective categories:

  • DeltaV is a superior control system
  • OSI PI is a superior data historian
  • JMP is a superior data analysis software
Where there is a gap, fermentation analysis is how to link upstream factors to downstream responses.

Friday, April 5, 2013

How To Interpret Distributions (Histograms)

Here's a set of Y-Distributions (histograms) I saw on the data visualization sub-Reddit.

On the left side, we have Polish language scores. On the right, we have mathematics.

Each row is a year... 2010 through 2012.

According to the notes on the page, these are the high-school exit exam scores for which passing is to receive 30% of the total available points.

Most people know what a "bell-shaped" curve looks like and those Polish language scores don't look like bells. In fact, it looks like right around the 30% mark, someone took the non-passing scores that were "close enough" and just handed out the passing score.

We sometimes see this in biotech manufacturing... where in order to proceed to the next step, you need to take a sample and measure the result. If there is a specification, you'll see a lot of just-passing results. What is euphemistically called, "Wishful sampling."

The process is the process and if the sampling is random, you expect a bell-shaped curve. In the case of Polish high school students, their Polish skills are what they are. What you're seeing is an artifact of the people grading the tests. I would bet a fair amount of money that teachers or schools are rewarded according to the number of students who pass this test.

Let's look at the mathematics scores. This "wishful grading" is going on in mathematics, but is far less pronounced. What is crazy is how different the distributions look from year to year (compared to the language histograms).

It's hard for me to think that mathematics skills of students across Poland vary that much from year to year. Like the U.S. News and World Reports rankings of schools, it's more likely that the difficulty of the test changes significantly from year to year... in this case with 2011 tests with particularly difficult questions.

Histograms say quite a bit about your process. What they never tell you is that the histograms also tell you quite a bit about your process specifications and how truthful your measurement systems are.

If I were the FDA... and I wanted to be mean about it, I'd request a distribution of measurements for every single process specification, and if I saw something like this "Polish language" test, someone has some explaining to do.

Get Biotech Manufacturing Consulting

Monday, April 1, 2013

Dr. Tom Little - Stats Ninja

There is an epidemic of statistical dunderheads working in the biotech industry. This epidemic probably plagues other sectors of the economy as well, but I'm not qualified to speak to that.

The reason for this complete lack of statistical knowledge (I think) is that statistics is not a part of the standard engineering curriculum. You get differential and integral calculus like crazy, but just one semester of basic engineering statistics and here's your diploma.

And as with most of undergraduate academia, it's not practical.

At my first job, we used the statistical software program - JMP - a lot. We were making a minimally invasive glucose monitor called the, GlucoWatch® Biographer and my entire job as a research engineer was to run in-house clinical studies and correlate the biographer performance against over-the-counter glucose meters. So we did a lot of linear correlations, I got to understand what p-values meant. And I figured out the primary purpose of engineering a system was to figure out what was signal and what was noise.

I think I might have even landed my second job because I knew how to use JMP. In fact, my second week on the job, the boss had his entire group go get JMP training in San Francisco where I had the luck of sharing a computer terminal with him.

Whatever the case, understanding enough statistics to know what tests are applicable when is really important. And when your group gets big enough that sending your team to off-site training becomes impractical, there is Dr. Thomas Little who will send practical stats gurus to train you.

Dr. Tom Little Statistics Consulting

Dr. Tom trained us (in a computer room) setting and a lot of this stuff was new at the time I learned it. ANOVA... Multivariate Analysis. Why use backwards stepwise regression... how to read the normal quantile plot... Capability... Control Charting.... All the things that are relevant to monitoring a production campaign.

When you get out of the class, you've leveled up in the world of biologics manufacturing and you look around and wonder why maintaining spreadsheets of cell culture data qualifies as plant support. You also start wondering why process development spends more time swinging male genitalia over higher titers rather than defining critical process parameters (CPPs) and identifying proven acceptable ranges (PARs).

Dr. Tom is pretty well-known in the world of biologics. I run into his team of consultants every third place I go. If your team isn't making statistically-sound, data-drive decisions, you seriously need to give him a call.

Call Dr. Tom


Tuesday, February 5, 2013

Ice Cream causes Swimming Pool Deaths!

I see this proverbial "Ice cream causes swimming pool drownings" statement made in the world of economics and politics all the time.

It's so prevalent that there's a Wikipedia article on spurious relationships.
[Ice cream] sales are highest when the rate of drownings in city swimming pools is highest.
You can look at the data over and you'll see that this phenomenon happens like clockwork:
  • Low ice cream sales... fewer swimming pool deaths.
  • High ice cream sales... many swimming pool deaths.
So there's a correlation, right? Yes.

With that correlation, some go farther to allege that ice cream causes drownings or that drownings causes ice cream sales. (Ahem, no.)

To claim that ice cream sales is an indicator of drownings or vice versa also misses the point because ice cream sales and swimming pool deaths are both results of an underlying factor; a heat wave.

Unfortunately, this statement of two symptoms indicating one another is seen all the time in the world of cell culture analysis:
  • Final ammonium (NH4+) is an indicator of culture performance
    - or -
  • Final lactate (Lac) is an indicator of product titer

credit: The Usual Suspects MGM

Seriously, who here doesn't already know that cell growth impacts culture performance?  Or that cell metabolism impacts culture performance?

Yet we are still publishing papers on how final lactate is an indicator of product titer and concluding that cell metabolism impacts culture performance.

Final ammonium or final lactate are symptoms of cell culture metabolic conditions that produce higher titers.

Unless you can:
  • Change media components
  • Change a parameter setpoint (pH, temp, dO2)
  • Change the timing of culture operations (temp shift, pH shift, timing of feeds...)
Essentially recommend specific changes the Production group can execute to improve culture conditions and you've simply uncovered a spurious relationship; there remains no action you can take to improve culture performance.

This is why it is best to start your multivariate analysis by picking actionable parameters to ensure that you have true factors.

When you pick actionable parameters to model as factors in your multivariate analysis, you have a shot at gaining control of an out-of-control campaign and meeting your Adherence-to-Plan, as Rob Johnson did.

If you're happy pontificating from ivory towers, keep making true-but-useless statements on how every time Y1 happens that Y2 also happens.