« April 2008 | Main | June 2008 »

11 posts from May 2008

Tuesday, May 27, 2008

Bitten by a JRE bug

We have an OLS.Switch implementation where we're forced to do a JNI call to our client's proprietary decryption DLL.  This approach is known for its instability (the Wikipedia entry is rife with warnings).  Alejandro raised the red flag on it since Day One.  His predictions came true: that DLL suffers from some stack management and other issues.  Every so often when volume is hot and heavy on a Friday afternoon, the JVM crashes.  The logs don't do us any good because what happens is outside OLS.Switch's and jPOS' span of control.  All we see in our logs is a five- to 10-second gap.

Fortuitously, our redundant application node strategy minimizes the impact.  The content switch fronting the nodes immediately swings traffic to the other node.  On the impacted server, we lose transactions in flight at time of the crash, but we close that hole with good follow-up because the origination points generate timeout reversals.  There are additional checks and balances on the extract and settlement side to clean up any impact.  [NOTE: The focus and concern on these events is always Debit/EBT impact because the 0200 Purchase request is what I call the 'letter of record.'  There's no problem on Credit.  The affected transactions simply never get placed on the resulting extract files.] 

This particular client runs on Windows.  We have the OLS.Switch application running as a service.  These are set to auto-start.  So, when the JVM crashes, the thing pops back up almost immediately.  Frankly, this resilience is both good and bad - good because no manual intervention is required and we run on a single node for only a short time; bad because the resilience and 'pop back-ability' of the thing causes people to take their eyes off of a fix of the root cause (do better stack management in the DLL).

Now, the story gets a twist.  With Sun EOL-ing JVM 1.4.2 in the fall of 2008, we made a move to upgrade to JRE 1.5.0.  Imagine our surprise when we put this into production and after our first DLL-induced crash as described above, OLS.Switch encountered this error when jPOS' Q2 tried to redeploy the TransactionManager upon service restart:

java.lang.NullPointerException
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1820)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
        at jdbm.recman.TransactionManager.recover(TransactionManager.java:224)
        at jdbm.recman.TransactionManager.<init>(TransactionManager.java:105)
        at jdbm.recman.RecordFile.<init>(RecordFile.java:349)
        at jdbm.recman.BaseRecordManager.<init>(BaseRecordManager.java:198)
        at jdbm.recman.Provider.createRecordManager(Provider.java:108)
        at jdbm.RecordManagerFactory.createRecordManager(RecordManagerFactory.java:114)
        at org.jpos.space.JDBMSpace.<init>(JDBMSpace.java:67)
        at org.jpos.space.JDBMSpace.getSpace(JDBMSpace.java:107)
        at org.jpos.space.SpaceFactory.createSpace(SpaceFactory.java:129)
        at org.jpos.space.SpaceFactory.getSpace(SpaceFactory.java:113)
        at org.jpos.space.SpaceFactory.getSpace(SpaceFactory.java:101)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at bsh.Reflect.invokeMethod(Unknown Source)
        at bsh.Reflect.invokeStaticMethod(Unknown Source)
        at bsh.Name.invokeMethod(Unknown Source)
        at bsh.BSHMethodInvocation.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHVariableDeclarator.eval(Unknown Source)
        at bsh.BSHTypedVariableDeclaration.eval(Unknown Source)
        at bsh.Interpreter.run(Unknown Source)
        at bsh.Interpreter.main(Unknown Source)

Ouch.  This error got the service into an ugly "half-baked" state, which I descibed in some breakdown analysis.  Here's what I said:

  1. The JVM running the ‘ols-switch’ service on APP02 crashed at 15:19:56 on Friday.  Cause of the crash was Decrypt.dll.  [See the five-second gap in attached snippet on the q2.log.  Other than that gap, we have no crash visibility as Decrypt.dll is outside the span of our control.]
  2. The service is set for auto-restart.  See deploy sequence in q2.log starting at “01_capture_date.xml”
  3. In the log, note the “NullPointerException” when attempting to deploy “30_ev_txnmgr.”  This is the program that accepts transactions requests from the server which listens to port 33000 (see next point).  It also handles EV, Discount Coupons, Check, Reward Cards + all transactions routed to FDR, AMEX and JCP.  [NOTE:  All SV-class transactions are forwarded by 30_ev_txnmgr to smaller, separate, independent transaction managers.]  As a result of the NPE, this participant did not initiate. 
  4. The deploy sequence continues after the error, including most notably, “50_ev_server.xml”.  This is the participant that listens to requests from port 33000 and queues them for sending to 30_ev_txnmgr.
  5. The deploy sequence completes without a loaded version of 30_ev_txnmgr.
  6. Transactions begin to be accepted by APP02 (because 50_ev_server is listening to the port).
  7. Because 30_ev_txnmgr isn’t available to accept messages, 50_ev_server simply deletes any request from its queue older than 60 seconds.
  8. After 150 seconds (two-and-a-half minutes) of inactivity, OLS.Switch’s Status Manager placed the “OLS.Switch TransManager 02” into a WARN state (yellow light).  This warning notes that the transaction manager is not processing transactions.
  9. As triage, the solution was to stop the ‘ols-switch’ service on APP02.  This stopped the ‘listen’ on port 33000, meaning that the load balancer shifted all traffic to APP01.
  10. APP01 ran the full chain load for the next 45+ minutes.
  11. We examined the code and realized that one of the JDBM files was corrupted.  Specifically, it was one of the STAN files (we use JBDM spaces for SAF and Logon Managers, too).  Whenever the JVM is ended unnaturally (as what happened here), there is the risk of this corruption happening. 
  12. 30_ev_txmgr has three JDBM files: amex-stan; fdr-stan; and jcp-stan.  These are spaces that keep a counter to assign the next System Trace Audit Number (‘STAN’).  Physically each JDBM implementation is two files, a *.lg and *.db file, so there were actually six files in total.  We instructed the SysAdmin to rename those six files (he prefixed them all with “100.”).  [NOTE:  At this point, the SysAdmin also changed the ‘OLS_JAVA’ environment variable on ITCOLSAPP02 to point to 1.4.2_09.]
  13. The service was restarted.  The application recreated the missing files and transactions begin flowing again.

Note the comment that the 'one of the JDBM files was corrupted.'  It certainly manifested itself that way in the trace and in subsequent testing.  As Alejandro noted:  "If there's a JVM corruption things, can be quite unpredictable."  We figured we'd just dodged a bullet until this point.  We didn't tie the JDBM behavior to the JRE 1.4.2 --> 1.5 upgrade.

We were wrong.  Now, the story takes another turn. 
OLS colleague Dave Bergert tells the story from here:

This step is a follow-up to the "corrupted JDBM" files issue, OLS’ proposed “pre-flight check,” and the installation of Java JRE 1.5  The files were never corrupted, it appeared to be a JVM bug.

OLS and [our client] tested the pre-flight check in production on May 9th using the crash files from April 11th (when we had JRE 1.5.0_15 installed and had a JVM crash).  The expected result was for the pre-flight check to attempt to open the JDBM files, discover that they were corrupt and shut down the system.  Instead, the result of the exercise was that the switch came up fine with the "corrupted files."

This result led us to discover that using JRE 1.4.2_9 - the "corrupted files" could be opened and used fine -- under JRE 1.5.0_15 - the files could not be opened and the crash check marked them as problem files to be renamed.

We performed further testing to see why the "corrupted files" were able to be opened under JRE 1.4.2_09 and not under JRE 1.5.0_15. OLS tested the following Java JREs on its MS Windows environment test environment to locally duplicate the issue for testing and resolution.

  • JRE 1.4.2_09

  • JRE 1.5.0_15

  • JRE 1.6.0_06

The results were that under JRE 1.4.2_09 and JRE 1.6.0_06 - the "corrupted files" were able to be opened and the switch would work; under JRE 1.5.0_15 the crash check would mark these files as corrupted.  It was odd that JRE 1.5.0_15 could not open the files but the other two could.

We did further investigation and discovered a Sun bug report that affected Java JRE versions 1.5.0_08 through 1.6.0_03.

We installed JRE 1.5.0_07 (The latest 1.5.0 release update not affected by this bug).  The "corrupted files" were able to be open and the switch would work -- suggesting this bug was the core issue. 

The "corrupted files" were never actually corrupted -- it was a SUN Java JRE bug resulting in a behavior in which the OLS.Switch could not application open/read these files under certain circumstances.  The bug existed for versions 1.5.0_08 through 1.6.0_03 of the Java JRE.

As part of the JRE investigation (as described above), we took the opportunity to upgrade the incorporated JDBM jar from v0.2 to v1.0.  We updated from the project’s CVS source and made an additional fix described here.

Monday, May 26, 2008

The Field Behavior of Transactions

Every transaction in a payment switch environment has a series of behaviors that go along with it.  In OLS.Switch, these behaviors are the focus of implementation of jPOS' TransactionManager and its participants.  A lot of factors come into play here:

  • Are we switching out the transaction for remote auth or handling things locally?
  • Can the originating device reverse the transaction?
  • Can the originating device void the transaction?
  • Is a duplicate transaction possible?
  • Is a Manager Override version of the transaction possible?
  • If the request for remote auth times out, are we obligated to send a reversal?

The answer to these questions can vary by card type, card brand and endpoint. 

For one customer, I synthesized all these behaviors into a one-page summary.  Take note that the originating point in this example uses a Visa Gen 2 message set, so the G.93 designations (the Debit Sale - listed here simply as an example) are fairly esoteric.  I believe it still gets the point across in a clear manner. 

Having this one-pager in hand allows anyone to summarize the operational model quickly.  For example, you can see that all 'Merchandise Return' transactions are processed locally, except the JC Penney ('JCP') version (which is switched out for auth) and the EBT Food version (where a timed-out external auth request is reversed via an entry placed in that endpoint's SAF queue).

0100 or 0200?

Ahmed El-Malatawy, a reader of this blog, has asked a series of good questions related to decisions he needs to make concerning his jPOS implementation.  Here's our latest exchange...

Ahmed asks:

I need to implement a purchase functionality (in general debit money from a specific account) from my application that uses JPOS as an ISO 8583 protocol implementation. The question is what is the message class that I can use: Authorization 1xx or Financial 2xx.  In other words, what is the suitable message class for the operations of debits, credits and money transfer between accounts - Authorization, Financial or Both?  I have a problem in differentiating between the 2 classes Authorization & Financial.

It's really a nice question because the answer is not obvious.

Here's my answer...

----------

Ahmed –

There’s no definitive right and wrong, but typically in practice the difference is typically drawn like this:

  • If the transaction is to have an actual effect on the customer’s account, then you use an 0200.*
  • If the transaction is instead to put a *hold* on funds in the person’s account in anticipation of a settlement record to come later, then you use an 0100.

As a real-life example, we ‘talk’ to FDR North.  If we do a credit card or offline debit transaction, we send an 0100 authorization.  As long as that transaction is not subsequently reversed, then we’re obligated to send a corresponding settlement record later that night.  It’s that piece of information – not the preceding authorization – that has the real effect on the customer’s account.  Conversely, an online Debit is what I call the ‘letter of record.’  It’s that transaction that affects the account , so it's implemented as an 0200.

Note that some gateways like FDR North may ask for Online Debit transactions to appear in the settlement file that night, but it’s more for institution-to-institution accounting purposes, rather than to affect individual cardholder balances.

For some better definitions of my usage here of the terms ‘credit,’ ‘offline debit’ and ‘online debit,’ take a look at my “Credit vs. Debit – Part 2” post.

----------

Just to give you an idea of how loosely this standard is applied, here's a quick 0100/0200 tally of various implementations we've got running:

  • FDR North (Credit)- 0100
  • FDR North (Offline Debit) - 0100
  • FDR North (Online Debit) - 0200*
  • FDR Omaha (Credit) - 0100
  • American Express (Credit) - 1100 (AMEX uses the 1993 version of ISO 8583)
  • Discover (Credit) - 0100
  • SVS (Pre-paid Balance Inquiry) - 0100
  • SVS (Pre-paid Refresh, Purchase, Return, Deactivate) - 0100
  • SVS (Pre-paid Activation) - 0300
  • GreenDot (Pre-paid Activation, Deactivation) - 0100
  • Incomm (Pre-paid Activation, Refresh, Deactivation) - 0200

*The simplifying caveat here is that I'm not delving into the whole Pre-auth/Completion scenario in this post.

Saturday, May 24, 2008

Replicating your issuer-side jPOS implementation in Country #2

I blog here mostly about all the development, operational and new initiative activity going on on the acquirer-side of OLS.Switch (our jPOS-based payment switch), but we've got an issuer side solution we created as well.  You don't hear about that much here because, by nature, there are not as many new initiatives, volume is lower (just a quirk of our client set mix) and there's only a single payment gateway in play in the solution environment.  Compare that last point with a typical acquirer-side solution.  At our flagship client, for example, we've got 32 endpoints in play. 

The reason I bring this up is that an issuer client of ours has approached us about replicating our in-place solution for an operation they want to bring online in another country.  I was asked to put together a list of items that would need analysis and probable change.  I made the assumption (confirmed, in this case) that the payment gateway we're using  here in the US instance won't be a viable option for Country #2 and beyond.  With that fact in hand, here's the list of touchpoints I had tallied:

  1. We need copies of all related specifications describing the interface that will perform the role that the current gateway provider now plays in conjunction with the US implementation.
  2. In particular, we need four pieces of information: the authorization message set and field descriptions; the telecommunications guide; the encryption guide; and the certification script.  Optimally, these documents should be provided to us in English.
  3. After we have time to review the documents, we need a formal walkthrough in the form of a teleconference with the gateway provider.
  4. We need to know details about the reversal model; specifically, how does the reversal get matched-up to the original?  (each gateway provider has a distinct match-up methodology).
  5. Similarly, we need to know details about the pre-auth/completion; specifically, how the completion get matched to the pre-auth? (similarly, each gateway provider has a distinct match-up methodology)
  6. We need to understand the testing facilities that will be made available to us, how to coordinate testing and what will be required for certification.
  7. We need to know if we need to have any specific language capabilities on our side to perform testing and certification.
  8. We also need to know whether any currency conversion issues will be in play.

Items 4 and 5 - how to match the reversal to the original; how to match the completion to the pre-auth - are worthy of a closer look.  I mention above that "each gateway provider has a distinct match-up methodology."  I have a write-up in my files that delves into the way that one such gateway provider does it.  What we, in turn, have to do is described here.  I think is pretty good one-page summary of things. 

Friday, May 23, 2008

Dave's Productivity

Daves_workspace Now I know why my OLS colleague Dave Bergert is so productive: the quad-core processor in his brain is hovering over all those machines dividing up his work cycles so that he's coding transaction simulators on one device, fixing production problems on another, chatting with me on Skype on a third, setting up PABP Audit environments on a fourth, doing jPOS commits on a fifth...

jPOS Created Here

Aprs_desktop_2 That's the desktop where the world's best payment systems infrastructure gets created.  I'm grateful to play a role in the world Alejandro has created.

In other news, it looks like we'll top 5,000 store locations today - or, by latest, next Tuesday - at our flagship OLS.Switch installation.  It's been a long time getting those 1,800 new stores on-boarded, but our client got some serious momentum over the last couple of months.  The last store gets converted on Friday, May 30th.

Sunday, May 18, 2008

Mother's Day Follow-up

Mothersdayweek2008A quick follow-up to my earlier Mother's Day post...our jPOS-enabled payment switch (acquirer side) continued to see big volume increases over Mother's Day weekend.  We topped a million on Saturday on OLS.Switch and, more notably, almost 800,000 on the big day itself.  If you compare that number to the previous Sunday, you can  make a reasonable assumption that about 150,000 panicky sons and daughters across this fine country piled into stores desperately seeking that last minute token of gratitude to stay in Mom's good graces for one more year. 

That 914,077 average for the week is an all-time high.

Saturday, May 10, 2008

Implementing MethCheck (New Initiatives, Part 2)

One of the gratifying things about the maturation of the OLS.Switch payment switch solution is that our clients have the confidence to come to us with challenging new point-of-sale payment initiatives.  Recently, we had one come our way that was quite unique.  It's called MethCheck, a new service from Appriss, Inc.  As you can tell from the service's name, this new initiative involves tracking the sale of Pseudophedrine.  This particular client is a large pharmacy chain, and they're being mandated by the state of Kentucky (surely only the first state of many to insist on this type of service) to implement it.  Here are some of Appriss' key talking points about the service:

  • A single point of contact for managing compliance, ensuring pharmacies are submitting all required data to law enforcement.
  • Tracks PSE purchase limits, any aggregate limits required by the law, box limits, pill counts, and acceptable forms of identification.
  • Multi-state Compliance Manager (MSCM) keeps up to date with new PSE legislation.
  • Communicates with state electronic PSE repositories allowing pharmacies to stay compliant without maintaining multiple interfaces.

Methcheck Here's a nice diagram (see pop-up at left) from the Appriss site describing their solution.  OLS.Switch in this setting is just a piece of the orange box (as marked-up by me) in the upper-right corner.

My colleague Dave Bergert has an excellent post on his blog describing the nitty-gritty details of how we got this very non-standard message set flowing through our application.  Most notably, he relates how we used jPOS' FSDMsg class to get the job done in less than a week (see here for a post I did about Extracts - part of of my 'Real Systems Do Extracts' series that ends up spotlighting FSD). 

As a payment systems manager, the key takeaway here is that by implementing a jPOS-based solution, you become the go-to-guy (or gal) in your organization for anyone envisioning innovative uses of your payment systems infrastructure.  MethCheck is far afield from Debit/EBT, for example, but we achieved co-existence without jeopardizing what's already in place.  This is good news for you because we know the frustrations of those of you managing legacy payment systems.  You tell us it always seems to be "six months and major bucks" from your vendor for any initiative of this scale.  And for those of you outsourcing, well, best of luck getting some attention.

350,000,000 and counting

Churchsign_4 Most days, we've got heads down, doing the continuous improvement thing on our jPOS implementations.  Today, I lifted my head long enough to notice that we just hit row ID 350,000,000 on our tranlog at a large OLS.Switch acquirer-side payment switch implementation.  That's not to say we've got that many rows loaded in there as we're pretty aggressive about our tranlog culling practices, but it is a  good "from Day One" counting proxy. 

As a complement to that feat, I mentioned to Alejandro earlier this week that the two Subversions that feed this particular implementation are sitting at a combined r3300 right now.  For anyone thinking the work tails off at some point, I can confidently state that it actually increases.  In fact, with our new payment initiatives, the speed of SVN check-ins right now is as rapid as it has ever been.  We take this pace of activity as a vote of confidence from our client.

Add Mother's Day to the pantheon of buying bulges

Mothers_day_2008_runup So, I was talking to my wife and mentioning that with our flagship OLS.Switch payment switch implementation closing in on 5,000 store locations (4,865 as of yesterday), we would soon see our first 'natural' 1,000,000 transaction day and that yesterday may come close.  To which she responded:  "Umm, Mother's Day anyone?" 

Oh, yeah.  That.

We serviced 964,582 separate customer interactions yesterday, as consumers across the four US Continental times streamed in to buy cards and other testamonials to their Moms.  This buying bulge matches previous waves we've seen pre-Easter, Valentine's Day and most notably on December 24th.

It's steady as she goes on the jPOS performance front - our two replicated application nodes sit at about a 50 MB Memory footprint, at peak we're at about 20% CPU on $7,000 Quad Core servers, and our optimized MS SQL Server virtual cluster barely tickles the CPU charts (about 2% - 5% or so).

Here's Friday, May 9th in depth.  We took a new application (internal Check Authorization) live this week, so followers of these charts will note some new rows as we start to roll out this new service.  It's deployed in one pilot location.

Saturday, May 03, 2008

The New Normal (continued)

0502_2 Store conversions continue at a rapid clip at our flagship OLS.Switch payment system acquirer-side implementation.  As noted in a previous post, we're approaching a situation where the 'new normal' will be 1,000,000 transactions a day from 5,000+ store locations.  Two weeks ago we were at 4,674 locations.  As of yesterday, we're at 4,801.  And this "typical" Friday now sees us handling 920,522 separate customer interactions including:

  • Credit card (American Express, MasterCard, Discover, Visa, JC Penney, JCB, Diners Club)
  • Offline (signature) debit card
  • Online ("PIN-ed") debit card (go here for my explanation of Credit vs. Debit)
  • Electronic Benefits Transactions ('EBT'), both the 'cash' and 'food stamp' transaction varieties
  • Our client's own branded Stored Value (aka, "Gift") Card, including brand differentiation (i.e., their brand as well as the brands of companies they've acquired and assimilated)
  • Phone cards
  • Third-party Stored Value cards (those ubiquitous "card malls" you've seen sprouting up in retailers all over, thanks to these guys)
  • Private label card
  • Employee Verification (if we validate the employee, the Store System applies a discount on the subsequent transaction)
  • Discount Coupon validation
  • Customer reward card "reverse lookup" by phone number

Coming soon:

  • Check authorization - Developed and certified; this is getting implemented into production this upcoming week...we expect to see another 50,000 or more transactions a day when this transaction type gets rolled out completely.
  • Online "MethCheck" Pseudoephedrine Inquiry - Developed and now in test and acceptance by our client (more on this new initiative in a subsequent post)
  • Healthcare Flexible Spending Accounts ('FSA') - Now under development (more on this new initiative in a subsequent post)

The full working week looked like this...presented in Spanish in homage to jPOS Project Lead Alejandro Revilla (this one, not this one) and his many fans in Latin America:

Lunes - 821,258
Martes - 847,890
Miércoles - 848,590
Jueves - 876,766
Viernes - 920,522

¿Impresionante, verdad?

One other observation is that the average response time for all externally authorized transactions was 928 milliseconds, an important metric because payment systems implementers always strive for the goal of "sub-second response time."  [NOTE:  I'm only measuring approvals; denials are skewed by issuer problems beyond our span of control.]  Two factors bring this number up higher to 1 second than we've seen in the past:

  • The client has rolled out ""PIN promting by BIN" at its stores, so we're weighted more heavily towards online Debit in the transaction mix.  Those transactions go through a gateway (and most times a regional network) before hitting the issuer, meaning it requires three to four hardware PIN translations along the way.
  • One of the Stored Value authorizers had an average response time yesterday about 2x their norm. 
My Photo

Tools

  • Google

    The entire web
    www.andyorrock.com
AddThis Social Bookmark Button

Resources

  • About Me
  • Dave Bergert's blog
    Insightful payment systems thoughts by my OLS colleague, Dave Bergert, CISSP, CISA, CompTIA Security+, and former Visa-certified QSA.
  • Glenbrook Partners' Blog List
    Glenbrook Partners has compiled "a current summary of the latest content from some of our favorite payments and banking blogs based upon their RSS feeds." Alejandro, Dave and I are on the list, as are many other good info sources.
  • jPOS
    Faced with payment systems challenges? Start here to learn more about Alejandro Revilla's jPOS project.
  • Randy San Nicolas' blog
    My OLS colleague Randy San Nicolas writes about his wealth of experience in various Issuer- and Acquirer-side endeavors in his Prepaid Enterprise blog.
  • soliSYSTEMS
    My friend Roque Solis is our go-to guy for RFID, smart cards, chip cards, integrated circuit(s) cards (ICC), HSMs, cryptographic accelerators, DES and public-key cryptography.
  • Specs Online - AMEX
    American Express (Amex) puts all its acquirer specs online for public retrieval.
  • Specs Online - First Data
    First Data Merchant Services (FDMS, aka 'FDR') puts all its acquirer specs online for public retrieval. [NOTE: FDMS' spec repository is accessible only via Internet Explorer; this link will not work with Firefox or other browsers.]
Blog Widget by LinkWithin

Enter your email address:

Delivered by FeedBurner

Blog powered by TypePad

If you're looking here...

  • Your attention to detail is a great asset. Use it wisely.