Subscribe
  • Home
  • /
  • Computing
  • /
  • Modern batch processing: a thing of the past or essential discipline?

Modern batch processing: a thing of the past or essential discipline?

Automation of modern day services.


Johannesburg, 24 Feb 2017

What is batch?

Typically, batch, as the word implies, relates to processing a collection and it has evolved to also mean performing multiple steps in some sequence. Adding scheduling to the mix adds the requirement for some event to occur before processing is initiated where the event may be the occurrence of a specific time, completion of another batch, creation of a file, the execution of a transaction, publishing a message on a queue and other similar events, says BMC Software.

Traditional batch like inventory processing, warehouse management, payroll and customer billing is still very much a major activity in almost every business computing environment. The big question for batch practitioners is whether new business services can or should use a batch approach in their implementations.

Applications operate in only a few modes.

One is interactive or real-time. In this mode, the application is communicating with another application or a human. Real-time means the response or process must produce a result immediately (or quickly enough to seem immediate) to satisfy the waiting second party. This is also the meaning of interactive where the application is interacting or conversing (where conversational is a synonym for interactive) with that same second party.

Another, relatively new mode is streaming. Data is arriving or flowing constantly and the expectation is that something is receiving the data and taking some action.

Everything else is unattended, asynchronous, background or "batch." This is arguably the mode where computers are most effective because they can operate at "machine speed" without having to wait for slow humans, slow devices delivering data or slow networks to enable communication with another application or computer.

A common misconception is that batch equals long-running, overnight, low priority or deferred (as in I don't need to do that now so I'll run it in batch later).

Although some batch processing may have those characteristics, there is also micro batch and "batch transactions" that are high-speed, iterative and efficient.

What is "modern batch processing"?

Since the inception of commercial data processing, batch has accounted for a significant share of all useful work done by computers.

Many traditional users still think batch is that "old" overnight processing process. This thinking can be traced to the modern application development lexicon that favours the notions and terminologies of 'real-timeness'. This is because of the overwhelming attention paid to the 'real-time' nature of everything we do, especially in the responsiveness required from modern digital transformation applications. I believe it is time to consider defining a new "Modern Batch" and give the term meaning in the DevOps lexicon. If you are familiar with the Wizard of Oz, the Scarecrow, Lion, and Tin Man just needed some external validation to confirm to the world what was already true but misunderstood. So batch "just" needs the right term to reveal its value to the world and help it claim its rightful place as not only relevant, but an essential discipline in a modern architecture.

Modern batch. to batch or not to batch? There's no question!

Let's consider an interactive application like online purchasing of just about anything. The consumer expects to see a catalogue of available goods, fill up a cart and check out. Below is a step-by-step review of the process:

* Purchase: Let's say one of the items purchased reduces inventory below a certain threshold. Let's also assume the goods have to be packaged and then shipped by a pre-determined courier or delivery company. Or perhaps the inventory is held by another supplier and that other company will package and ship the goods.

* Inventory replenishment: Once the order has been confirmed, do you think that inventory is replenished immediately? Perhaps this purchase was made at 9am and hundreds or thousands of additional items will be purchased throughout the day. It seems reasonable to wait until some point, either a specific time or when some number of items have been purchased before placing an order to replenish the inventory. And once the replenishment is initiated, it's very likely that process consists of several steps occurring over days (or longer) which require coordination, monitoring and visibility into the progress of the entire process.

* Packaging for shipment: What about the packaging? Does it make sense that the ordered items are picked immediately and boxed for shipment as soon as the transaction is completed or is it reasonable to wait again until a specific time or a specific number of certain goods can be fetched?

* Shipping: Finally, how about the shipping? Should we schedule a pickup for each order or only at the end of the day? And if we do request a shipment as soon as an order is placed, is it reasonable for a truck to come and pick up the single order and immediately begin the delivery process?

Big Data - movement, formatting and storage

Big Data manipulation is another classic use for batch. Collections of records, what may be known as a file or a "data set," fits extremely well into the definition of a batch. In fact, almost any application that processes bulk data is really a batch application.

As traditional data management is being overhauled and disrupted with technologies like Hadoop and Big Data, batch continues to be among the most common use cases in this sector for both traditional and modern applications.

Moving data is inherently a batch process because typically files, a synonym for a batch of data, are involved. Previously, I mentioned streaming. This is a relatively new term that is applied very broadly. It is assumed that because data is flowing in real time, each individual record correspondingly must be processed in real time. This is frequently not the case. In fact, one of the most popular new technologies in the Big Data world is Spark Streaming. It is a "micro-batch" implementation. Even when "data in motion" is being processed, it is also almost always stored for subsequent processing, usually in some batch mode.

Extract, transfer, load (ETL)

Data occurs in so many formats that it's almost always mandatory to re-format, edit, normalise or apply some other processing to make it useful. This collection of actions is described as ETL. An amusing play on words is a new variant called ELT used by big data aficionados to highlight the non-deterministic approach of big data that defers the transform phase to the last step so that potential insights are not biased by format. It is still largely a batch process.

Storage

All data that has any potential value beyond its immediate, real-time usage, must be stored somewhere and managed somehow. This turns out to be almost all data and one may be hard pressed to find any exceptions. Even intentionally temporal social media conversations that are designed to "self-destruct" within some short amount of time still must be managed for that duration. Even that destruction after the desired time has elapsed is a data management function and is simply a specific use-case of traditional records and retention management that demands destruction upon expiration.

Share