Automated Processes (batches)

Give the right man the right tool and he will work true miracles. Now, give the wrong tool to the wrong guy, and disaster is guaranteed! I’ve been insisting on this for some time now, and I believe it’s worth stressing once again.

Databases are growing fast. And no, I’m not talking about big data. OLTP databases are growing, and with that, the amount of information to be scrutinized in search of knowledge increases every day.

You can’t win this battle with a simple batch of SELECT statements. No, you can’t!

As our databases grow, their complexity in terms of maintenance and optimization also grows. It becomes harder and slower to extract statistical and managerial results from them.

When tools are lacking, there must be intelligence and creativity in abundance.

Big Data is not the answer to everything. Believe me! Some things should stay in your good old RDBMS (whatever it may be).

That said, enough small talk — let’s get to the point: Automated Processes. The good old batch! Uncle here started working with IT in 1981. A precocious one.

But real IT revealed itself to me around 1984. By 1986, I was already dealing with serious problems. Back then, we had to be creative because our powerful “minis” had no more than 256KB (that’s right!) of memory and hard drives with their mighty 5MB (yes, really!). Everything at that time was based on OFFLINE or BATCH processing.

The “trend” came back about eight years ago, and now, more than ever, it’s necessary. It returned partly due to the increase in database sizes.

But what the heck is that? It means admitting that we don’t have the computational power to process everything we need at the speed of light.

I’ve seen queries on databases with 100, 200, 300, and even 500 million rows with absolutely no need for that. No intelligence, or to be clearer: no willingness to write a bit smarter code.

One point I intend to cover in another article is the real need to keep so many rows in production, in the hot database.

Some of the main queries I’ve seen on really large tables include:

– Accumulated sales per period by: customer, product, salesperson, business unit, region, macro and micro region, category, subcategory, etc.;

– Customer relationships integrating: proposals, sales, products, returns, services, contacts, etc.;

– Credit and risk analysis: payment capacity, purchases reported by partners, credit service records, etc.;

I could list countless others, but I believe these three different examples give a good sense of what I want to propose.

Trying to extract this information from hot databases, attempting to solve the problem while summarizing it all into a single SQL statement, is not smart. You spend an enormous amount of time trying to build this SQL with so many INNER and OUTER JOINs, subqueries that drive your RDBMS crazy, and leave the user disappointed with the performance.

I’m avoiding naming any RDBMS because this applies to all of them. Always remember: Divide and Conquer — meaning break a problematic SQL into multiple performant ones.

Alright, but here’s the point: You should not go to the hot database for statistical or managerial information involving a large volume of data! Doing so can cause lockups, downtime, slowness, and worst of all: unhappy customers.

Therefore, to create queries from large databases, we need to use more guts and be more intelligent and creative. That’s where batches, those automated processes, come in.

We need to create denormalized tables to store a multitude of accumulated data in an intelligent and pragmatic way. For example, if we need to track over time how much each customer bought of each product line, we create a table that stores exactly that, plus customer names and product names, basically all the necessary information so that a single SELECT without any JOIN solves the query.

A batch process that runs periodically (every hour, daily, weekly, depending on your needs) reads the hot database, with all its normalized tables, and writes the aggregated results into this single denormalized table. This technique is also known as a summary table. It’s far more rational and saves tons of data processing. Imagine getting the sales volume of a certain product category over the last 5 years requires reading 10 tables, with the main sales table having 500 million rows. If you run this query 8 times a day, you’re literally in trouble!

Now imagine implementing an automated batch process that collects all information for all products, categories, customers, periods—every possible combination—and writes this data daily into a summary table. Instead of reading those 500 million rows, now you only have to read a few hundred. Makes sense, right?

Most database systems offer options to create automated processes. In MySQL, we have EVENTS; in Oracle, we have JOBs. Again, I want to cover this in another article.

But there are other ways to create and manage your automated tasks, such as:

A/ Operating System level

– Linux/Unix: Using CRON (crontab), you can schedule the OS to run SQL scripts, PHP, Perl, Java, Python, etc. The big advantage of CRON over embedded RDBMS batches is the freedom to choose your programming language, plus all the flexibility and features these languages provide.

– Windows: See? I’m not so strict—I wrote Windows properly. On Windows, you can use Task Scheduler, similar to CRON.

B/ Batch Tools

– SpringBatch (http://static.springsource.org/spring-batch/): a framework for processing large amounts of data in enterprise and mission-critical environments. Not comparable to CRON. SpringBatch is more of an orchestrator that gives total control over batch jobs running under your task server. Source code is written in Java, supports code reusability, SOA, queues, web platform. Definitely worth checking out. SpringBatch is part of the famous Spring Java development framework, maintained by SpringSource.

– Quartz (http://www.quartz-scheduler.org): Unlike SpringBatch, which is part of a development platform, Quartz is the product itself — specialized and focused. In my tests, I was impressed by its many options and configurations. This tool is truly for mission-critical, high-volume data environments. It’s free, supports clustering, load balancing, and failover. Like SpringBatch, it can run any Java class.

There are many others, but in my daily work I use the good old OS Crontab, plus MySQL’s Event scheduler and Oracle’s Job scheduler. Thanks to the great Lord Darth Vader, I never had to use Windows Task Scheduler (had to sneak that in, right?). If anyone thought about PHP Scheduler, they didn’t pay attention to this article 🙂

Currently, I use SpringBatch and Quartz, the latter having earned my respect for its simplicity and features.

I can’t think of a company or organization that doesn’t have some tasks that can or should be processed in batch. And I have no doubt the benefits of this practice are many: better performance, greater process control, security, and most importantly: happy customers!

Schedule a meeting here

Visit our Blog

Learn more about databases

Learn about monitoring with advanced tools

Have questions about our services? Visit our FAQ

Want to see how we’ve helped other companies? Check out what our clients say in these testimonials!

Discover the History of HTI Tecnologia

Institucional

Sustentação

Monitoramento

Consultoria

Contact

Siga-nos nas redes sociais