Bitten by a JRE bug
We have an OLS.Switch implementation where we're forced to do a JNI call to our client's proprietary decryption DLL. This approach is known for its instability (the Wikipedia entry is rife with warnings). Alejandro raised the red flag on it since Day One. His predictions came true: that DLL suffers from some stack management and other issues. Every so often when volume is hot and heavy on a Friday afternoon, the JVM crashes. The logs don't do us any good because what happens is outside OLS.Switch's and jPOS' span of control. All we see in our logs is a five- to 10-second gap.
Fortuitously, our redundant application node strategy minimizes the impact. The content switch fronting the nodes immediately swings traffic to the other node. On the impacted server, we lose transactions in flight at time of the crash, but we close that hole with good follow-up because the origination points generate timeout reversals. There are additional checks and balances on the extract and settlement side to clean up any impact. [NOTE: The focus and concern on these events is always Debit/EBT impact because the 0200 Purchase request is what I call the 'letter of record.' There's no problem on Credit. The affected transactions simply never get placed on the resulting extract files.]
This particular client runs on Windows. We have the OLS.Switch application running as a service. These are set to auto-start. So, when the JVM crashes, the thing pops back up almost immediately. Frankly, this resilience is both good and bad - good because no manual intervention is required and we run on a single node for only a short time; bad because the resilience and 'pop back-ability' of the thing causes people to take their eyes off of a fix of the root cause (do better stack management in the DLL).
Now, the story gets a twist. With Sun EOL-ing JVM 1.4.2 in the fall of 2008, we made a move to upgrade to JRE 1.5.0. Imagine our surprise when we put this into production and after our first DLL-induced crash as described above, OLS.Switch encountered this error when jPOS' Q2 tried to redeploy the TransactionManager upon service restart:
java.lang.NullPointerException
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1820)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at jdbm.recman.TransactionManager.recover(TransactionManager.java:224)
at jdbm.recman.TransactionManager.<init>(TransactionManager.java:105)
at jdbm.recman.RecordFile.<init>(RecordFile.java:349)
at jdbm.recman.BaseRecordManager.<init>(BaseRecordManager.java:198)
at jdbm.recman.Provider.createRecordManager(Provider.java:108)
at jdbm.RecordManagerFactory.createRecordManager(RecordManagerFactory.java:114)
at org.jpos.space.JDBMSpace.<init>(JDBMSpace.java:67)
at org.jpos.space.JDBMSpace.getSpace(JDBMSpace.java:107)
at org.jpos.space.SpaceFactory.createSpace(SpaceFactory.java:129)
at org.jpos.space.SpaceFactory.getSpace(SpaceFactory.java:113)
at org.jpos.space.SpaceFactory.getSpace(SpaceFactory.java:101)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at bsh.Reflect.invokeMethod(Unknown Source)
at bsh.Reflect.invokeStaticMethod(Unknown Source)
at bsh.Name.invokeMethod(Unknown Source)
at bsh.BSHMethodInvocation.eval(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.BSHVariableDeclarator.eval(Unknown Source)
at bsh.BSHTypedVariableDeclaration.eval(Unknown Source)
at bsh.Interpreter.run(Unknown Source)
at bsh.Interpreter.main(Unknown Source)
Ouch. This error got the service into an ugly "half-baked" state, which I descibed in some breakdown analysis. Here's what I said:
- The JVM running the ‘ols-switch’ service on APP02 crashed at 15:19:56 on Friday. Cause of the crash was Decrypt.dll. [See the five-second gap in attached snippet on the q2.log. Other than that gap, we have no crash visibility as Decrypt.dll is outside the span of our control.]
- The service is set for auto-restart. See deploy sequence in q2.log starting at “01_capture_date.xml”
- In the log, note the “NullPointerException” when attempting to deploy “30_ev_txnmgr.” This is the program that accepts transactions requests from the server which listens to port 33000 (see next point). It also handles EV, Discount Coupons, Check, Reward Cards + all transactions routed to FDR, AMEX and JCP. [NOTE: All SV-class transactions are forwarded by 30_ev_txnmgr to smaller, separate, independent transaction managers.] As a result of the NPE, this participant did not initiate.
- The deploy sequence continues after the error, including most notably, “50_ev_server.xml”. This is the participant that listens to requests from port 33000 and queues them for sending to 30_ev_txnmgr.
- The deploy sequence completes without a loaded version of 30_ev_txnmgr.
- Transactions begin to be accepted by APP02 (because 50_ev_server is listening to the port).
- Because 30_ev_txnmgr isn’t available to accept messages, 50_ev_server simply deletes any request from its queue older than 60 seconds.
- After 150 seconds (two-and-a-half minutes) of inactivity, OLS.Switch’s Status Manager placed the “OLS.Switch TransManager 02” into a WARN state (yellow light). This warning notes that the transaction manager is not processing transactions.
- As triage, the solution was to stop the ‘ols-switch’ service on APP02. This stopped the ‘listen’ on port 33000, meaning that the load balancer shifted all traffic to APP01.
- APP01 ran the full chain load for the next 45+ minutes.
- We examined the code and realized that one of the JDBM files was corrupted. Specifically, it was one of the STAN files (we use JBDM spaces for SAF and Logon Managers, too). Whenever the JVM is ended unnaturally (as what happened here), there is the risk of this corruption happening.
- 30_ev_txmgr has three JDBM files: amex-stan; fdr-stan; and jcp-stan. These are spaces that keep a counter to assign the next System Trace Audit Number (‘STAN’). Physically each JDBM implementation is two files, a *.lg and *.db file, so there were actually six files in total. We instructed the SysAdmin to rename those six files (he prefixed them all with “100.”). [NOTE: At this point, the SysAdmin also changed the ‘OLS_JAVA’ environment variable on ITCOLSAPP02 to point to 1.4.2_09.]
- The service was restarted. The application recreated the missing files and transactions begin flowing again.
Note the comment that the 'one of the JDBM files was corrupted.' It certainly manifested itself that way in the trace and in subsequent testing. As Alejandro noted: "If there's a JVM corruption things, can be quite unpredictable." We figured we'd just dodged a bullet until this point. We didn't tie the JDBM behavior to the JRE 1.4.2 --> 1.5 upgrade.
We were wrong. Now, the story takes another turn. OLS colleague Dave Bergert tells the story from here:
This step is a follow-up to the "corrupted JDBM" files issue, OLS’ proposed “pre-flight check,” and the installation of Java JRE 1.5 The files were never corrupted, it appeared to be a JVM bug.
OLS and [our client] tested the pre-flight check in production on May 9th using the crash files from April 11th (when we had JRE 1.5.0_15 installed and had a JVM crash). The expected result was for the pre-flight check to attempt to open the JDBM files, discover that they were corrupt and shut down the system. Instead, the result of the exercise was that the switch came up fine with the "corrupted files."
This result led us to discover that using JRE 1.4.2_9 - the "corrupted files" could be opened and used fine -- under JRE 1.5.0_15 - the files could not be opened and the crash check marked them as problem files to be renamed.
We performed further testing to see why the "corrupted files" were able to be opened under JRE 1.4.2_09 and not under JRE 1.5.0_15. OLS tested the following Java JREs on its MS Windows environment test environment to locally duplicate the issue for testing and resolution.
-
JRE 1.4.2_09
-
JRE 1.5.0_15
-
JRE 1.6.0_06
The results were that under JRE 1.4.2_09 and JRE 1.6.0_06 - the "corrupted files" were able to be opened and the switch would work; under JRE 1.5.0_15 the crash check would mark these files as corrupted. It was odd that JRE 1.5.0_15 could not open the files but the other two could.
We did further investigation and discovered a Sun bug report that affected Java JRE versions 1.5.0_08 through 1.6.0_03.
We installed JRE 1.5.0_07 (The latest 1.5.0 release update not affected by this bug). The "corrupted files" were able to be open and the switch would work -- suggesting this bug was the core issue.
The "corrupted files" were never actually corrupted -- it was a SUN Java JRE bug resulting in a behavior in which the OLS.Switch could not application open/read these files under certain circumstances. The bug existed for versions 1.5.0_08 through 1.6.0_03 of the Java JRE.
As part of the JRE investigation (as described above), we took the opportunity to upgrade the incorporated JDBM jar from v0.2 to v1.0. We updated from the project’s CVS source and made an additional fix described here.






