High-efficiency Data Business Development Tool esProc SPL

1. What is esProc
2. Why you need esProc
3. Application scenario

Managing Endless Report Development

01What is esProc?

What is esProc?

Data Business Development Tool

Designed for structured and semi-structured data handling and computing
Intended for application developers and data analysts
Has built-in light-weight scripting language SPL

SPL: Structured Process Language

What are the counterparts of esProc？

Various development tools for data analytics (and corresponding programming languages)

SQL-based

Java-based

Python-based

What pain points does esProc solve？

Each of those data process technologies (tools and languages) has their defects:

SQL-based

SQL uses hard to write multilayer nested queries to implement complex tasks
The debugging functionalities IDE offers are poor

Java-based

Java/Stream lacks certain class libraries, making task implementation difficult
ORM offers limited technical abilities

Python-based

Confused concepts and poor syntactic consistency
Integration-unfriendly

What values does esProc bring?

Efficiency increased

Increase development efficiency
Boost computing performance

Costs reduced

Reduce HR/development costs
Reduce hardware costs

02Why you need esProc

Why you need esProc？

1 Easy to edit and debug
2 Distinguished syntax and a rich library of functions
3 Diverse data sources support
4 Lightweight and integration-friendly
5 High performance

Easy to edit and debug

Distinguished syntax

Function option syntax

pos@z("abcdeffdef","def")	//@z: backwards searching
pos@c("abcdef","Def")	//@c: Case insensitive
pos@zc("abcdeffdef","Def") //Options working together

Cascaded parameter

//The three-level parameter: first level represented by semicolon, second level represented by comma, and third level represented by colon 
join(Orders:o,SellerId;Employees:e,EId).groups(e.Dept;sum(o.Amount))

Implicit Lambda syntax

//Implicit Lambda syntax, further simplified by symbols [] # ~
stock.sort(tradeDate).group@i(price< price[-1]).max(~.len())

Powerful computing capability

Order-based computation

// Position searching and relative reference of position stock.calc(stock.pselect(price>100),price-price[-1])

Ability to retain grouped subsets

// Further computations on each member subset of group
T("employee.csv").group(DEPT).select(~.len()>10).conj()

Ability to operate multilevel data

//Directly use dot（.）to reference data at a lower level
json(file("orders.json").read())orders.select(order_details.select@1(product.category=="Electronics") && order_details.sum(price*quantity)>200).new(order_id,order_date)

Computation comparison Find the largest count of consecutive rising days

Java

public Map<String, Integer> calculateMaxConsecutiveIncreaseDays(List<StockRecord> stockRecords) {
	Map<String, List<StockRecord>> groupedByCode = stockRecords.stream()
			.collect(Collectors.groupingBy(
					StockRecord::getCode,
					Collectors.collectingAndThen(
							Collectors.toList(),
							list -> {
								list.sort(Comparator.comparing(StockRecord::getDt));
								return list;
							}
					)
			));
	Map<String, Integer> result = new HashMap<>();
	for (Map.Entry<String, List<StockRecord>> entry : groupedByCode.entrySet()) {
		String code = entry.getKey();
		List<StockRecord> records = entry.getValue();
		if (records.isEmpty()) continue;
		Map<Integer, Integer> consecutiveDaysMap = new HashMap<>();
		int cumulativeSum = 0;
		for (int i = 0; i < records.size(); i++) {
			StockRecord current = records.get(i);
			int flag;
			if (i == 0) {
				flag = 1; 
			} else {
				StockRecord prev = records.get(i - 1);
				flag = current.getCl() > prev.getCl() ? 0 : 1;
			}
			cumulativeSum += flag; 
			consecutiveDaysMap.merge(cumulativeSum, 1, Integer::sum);
		}
		int maxDays = consecutiveDaysMap.values().stream()
				.max(Comparator.naturalOrder())
				.orElse(0);

		result.put(code, maxDays);
	}
	return result;
}

SQL

SELECT CODE, MAX(con_rise) AS max_increase_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT,  SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
                    CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
                    ELSE 1 END AS updown_flag
            FROM stock
        )
    )
    GROUP BY CODE, no_up_days
)
GROUP BY CODE

Python

Simport pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])

SPL

stock.sort(DT).group(CODE;~.group@i(CL< CL[-1]).max(~.len()):mids)

esProc SPL needs only one line of code！

Rich function library

Basic functions

Date
String
…

Math & statistical functions

Vector
Regression
Classification algorithm
Statistical & analytic functions
…

Set operations

Search
Group
Association
…

File & network processing

CSV/Excel
XML/JSON
RESTful interface
…

Database handling

Query
Update
Transaction
…

Big data processing

Cursor
Parallel processing
…

Chart plotting

…

Natural mixed-source computations

Direct access to diverse data sources to perform mixed-source computations via conveniently extensible native interface

Some of the data sources esProc SPL supports

Mixed-source computation example

	A
1	=connect("mysql")
2	=A1.query@x("SELECT o.order_id, o.user_id, o.order_date, oi.product_id, oi.quantity, oi.price FROM orders o JOIN order_items oi ON o.order_id = oi.order_id WHERE o.order_date >= CURDATE() - INTERVAL 1 MONTH")
3	=mongo_open("mongodb://192.168.1.15:27017/raqdb")
4	=mongo_shell@d(A3, "{ 'find': 'products', 'filter': { 'category': { '$in': ['Tablets', 'Wearables', 'Audio'] } }}” )
5	=A2.join@i(product_id,A4:product_id,name,brand,category,attributes)
6	=A5.groups(category;sum(price*quantity):amount)

Mixed computation between MySQL & MongoDB

Lightweight

JDK1.8 or higher version of JVM
The core deployment package is less than 15M and easy to distribute
Can run on any OS, including VM and Container, and even Android
Low hardware requirement; can run on a PC

Application integration architecture

Integration-friendly

Working well with various Java frameworks & reporting tools

Called via standard JDBC

High performance

Storage

esProc offers two high-performance storage formats

Bin file

Simple format, no need to define structure

Composite file

Mixed row-based and columnar storage, support index, and need to define structure beforehand

IO performance is more than 15 times higher than conventional databases

Algorithms

In-memory search

Location by sequence number
Position-based index
HASH index
Location by multilevel sequence number
In-memory table index

External storage search

Sorting index
Value-attached index
Index preloading
Batch searching
Full-text search

Traversal technology

Delayed cursor
New interpretation of aggregation
Ordered cursor
Multi-purpose traversal
Pre-filtered traversal

High-efficiency association

Foreign key pointerization
Foreign key numberization
Ordered merge
Attached table
Unilateral partition join

Algorithm example - TopN

SQL


//To get TopN from entire set，database engine will perform optimization to avoid big sorting
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC

//For an intra-group TopN，database optimization is disabled by the nested query and can only turn to big sorting
SELECT * FROM (
      SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn 
      FROM Orders ) 
WHERE rn<=10

SPL

	A	B
1	=file("Orders.ctx").open().cursor()
2	=A1.groups(;top(10;-Amount))	Get TopN from entire set
3	=A1.groups(Area;top(10;-Amount))	GET TopN from each group

SPL treats the TopN problem as an aggregate operation that returns a set，which gets rid of the big sorting; entire set topN and intra-group TopN have similar code and both have high performance

03Application scenario

Data preparation layer for report & query

SPL agile computing increases development efficiency by avoiding complex SQL and stored procedures
Computing capability independent of databases; conveniently migratable between databases
Interpreted execution; natural hot-swap; loosely coupled report module and application
Handle endless report development needs at low cost

Java data logic/microservice implementation

Purely Java-based; can be packaged with the main application to enjoy advantages of Java’s mature framework
Agile development can replace Stream/Kotlin/ORM
Computing capability independent of databases; conveniently migratable between databases
Interpreted execution, hot-swap, low coupling

Replacing stored procedures

Purely Java-based; can be packaged with the main application to enjoy advantages of Java’s mature framework and avoid disadvantages of stored procedures
Powerful procedural computation, easy to debug, and high development efficiency
Perform computations independent of databases; intrinsically migratable between databases
Do not need to compile privileges of stored procedure; low application coupling; increased security and reliance

Slimming down databases/Eliminating intermediate tables

Move non-critical intermediate data from database to files to reduce database storage burden
Tree-structure directories are easy to manage and creates low coupling
Perform computations outside the database to reduce database’s computing burden
File access has higher performance, which greatly increases computing performance

Mixed-source computations/Realtime full data analytics

Support of a rich collection of data sources: RDB, NoSQL, File, HTTP,...；and multilevel data such as JSON
Direct access and computation without database loading, which brings realtimeness
Direct mixed computations between heterogeneous databases (between production database & analytic database) to achieve realtime data analytics
Computing capability independent of data sources; intrinsically migratable

Embedded/Edge computing engine

The small-size, fully embedded app can be used for edge computing engine
All-embracing functionalities, including math library, which gets rid of extra components for most computing tasks
Simple file storage without the need of databases
Can connect to remote large-scale data sources and storage devices

Preparation for outside-app data cleansing

A rich collection of data sources accessed with a unified style, making it convenient to access various data sources where SQL is absent
Powerful description ability, making it more concise and convenient than Python to express complex problems
Parallel computation, making it faster and convenient than Python in big data handling
High integrability, which transforms the computation to an intra-app one as needed

Exploration and analytics by data scientists

Powerful description ability, making it more concise and convenient than Python to express complex problems
Higher interactivity than SQL and Python, and more convenient to debug
File storage creates portable and independent data, which can be analyzed anywhere without databases
Parallel computation, making it faster and convenient than Python in big data handling

Desktop data analytics

Hot computation of cold data

Historical cold data required to be loaded to the database takes up large amounts of space, causing complex operation & maintenance and high costs yet with a low frequency of use
Ad-hoc loading is inefficiency as it takes much more time than the computation time

SPL directly computes files without loading them to some kind of “database”, making data not “cold” any more and creating hot computations at low storage costs

Download

Community

Managing Endless Report Development