01What is esProc?

What is esProc?

Data Business Development Tool

  • Designed for structured and semi-structured data handling and computing
  • Intended for application developers and data analysts
  • Has built-in light-weight scripting language SPL
SPL: Structured Process Language

What are the counterparts of esProc?

Various development tools for data analytics (and corresponding programming languages)

SQL-based
Java-based
Python-based

What pain points does esProc solve?

Each of those data process technologies (tools and languages) has their defects:

SQL-based
  • SQL uses hard to write multilayer nested queries to implement complex tasks
  • The debugging functionalities IDE offers are poor
Java-based
  • Java/Stream lacks certain class libraries, making task implementation difficult
  • ORM offers limited technical abilities
Python-based
  • Confused concepts and poor syntactic consistency
  • Integration-unfriendly

What values does esProc bring?

Efficiency increased

  • Increase development efficiency
  • Boost computing performance

Costs reduced

  • Reduce HR/development costs
  • Reduce hardware costs

02Why you need esProc

Why you need esProc?

  • 1 Easy to edit and debug
  • 2 Distinguished syntax and a rich library of functions
  • 3 Diverse data sources support
  • 4 Lightweight and integration-friendly
  • 5 High performance

Easy to edit and debug

Distinguished syntax

Function option syntax

pos@z("abcdeffdef","def") //@z: backwards searching pos@c("abcdef","Def") //@c: Case insensitive pos@zc("abcdeffdef","Def") //Options working together

Cascaded parameter

//The three-level parameter: first level represented by semicolon, second level represented by comma, and third level represented by colon join(Orders:o,SellerId;Employees:e,EId).groups(e.Dept;sum(o.Amount))

Implicit Lambda syntax

//Implicit Lambda syntax, further simplified by symbols [] # ~ stock.sort(tradeDate).group@i(price< price[-1]).max(~.len())

Powerful computing capability

Order-based computation

// Position searching and relative reference of position stock.calc(stock.pselect(price>100),price-price[-1])

Ability to retain grouped subsets

// Further computations on each member subset of group T("employee.csv").group(DEPT).select(~.len()>10).conj()

Ability to operate multilevel data

//Directly use dot(.)to reference data at a lower level json(file("orders.json").read())orders.select(order_details.select@1(product.category=="Electronics") && order_details.sum(price*quantity)>200).new(order_id,order_date)

Computation comparison Find the largest count of consecutive rising days

Java

public Map<String, Integer> calculateMaxConsecutiveIncreaseDays(List<StockRecord> stockRecords) {
	Map<String, List<StockRecord>> groupedByCode = stockRecords.stream()
			.collect(Collectors.groupingBy(
					StockRecord::getCode,
					Collectors.collectingAndThen(
							Collectors.toList(),
							list -> {
								list.sort(Comparator.comparing(StockRecord::getDt));
								return list;
							}
					)
			));
	Map<String, Integer> result = new HashMap<>();
	for (Map.Entry<String, List<StockRecord>> entry : groupedByCode.entrySet()) {
		String code = entry.getKey();
		List<StockRecord> records = entry.getValue();
		if (records.isEmpty()) continue;
		Map<Integer, Integer> consecutiveDaysMap = new HashMap<>();
		int cumulativeSum = 0;
		for (int i = 0; i < records.size(); i++) {
			StockRecord current = records.get(i);
			int flag;
			if (i == 0) {
				flag = 1; 
			} else {
				StockRecord prev = records.get(i - 1);
				flag = current.getCl() > prev.getCl() ? 0 : 1;
			}
			cumulativeSum += flag; 
			consecutiveDaysMap.merge(cumulativeSum, 1, Integer::sum);
		}
		int maxDays = consecutiveDaysMap.values().stream()
				.max(Comparator.naturalOrder())
				.orElse(0);

		result.put(code, maxDays);
	}
	return result;
}

SQL

SELECT CODE, MAX(con_rise) AS max_increase_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT,  SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
                    CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
                    ELSE 1 END AS updown_flag
            FROM stock
        )
    )
    GROUP BY CODE, no_up_days
)
GROUP BY CODE

Python

Simport pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])

SPL

stock.sort(DT).group(CODE;~.group@i(CL< CL[-1]).max(~.len()):mids)
esProc SPL needs only one line of code!

Rich function library

Basic functions

  • Date
  • String

Math & statistical functions

  • Vector
  • Regression
  • Classification algorithm
  • Statistical & analytic functions

Set operations

  • Search
  • Group
  • Association

File & network processing

  • CSV/Excel
  • XML/JSON
  • RESTful interface

Database handling

  • Query
  • Update
  • Transaction

Big data processing

  • Cursor
  • Parallel processing

Chart plotting

Natural mixed-source computations

Direct access to diverse data sources to perform mixed-source computations via conveniently extensible native interface

Some of the data sources esProc SPL supports

Mixed-source computation example

A
1=connect("mysql")
2=A1.query@x("SELECT o.order_id, o.user_id, o.order_date, oi.product_id, oi.quantity, oi.price
FROM orders o JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURDATE() - INTERVAL 1 MONTH")
3=mongo_open("mongodb://192.168.1.15:27017/raqdb")
4=mongo_shell@d(A3,
"{ 'find': 'products',
'filter': {
'category': { '$in': ['Tablets', 'Wearables', 'Audio'] }
}}”
)
5=A2.join@i(product_id,A4:product_id,name,brand,category,attributes)
6=A5.groups(category;sum(price*quantity):amount)

Mixed computation between MySQL & MongoDB

Lightweight

  • JDK1.8 or higher version of JVM
  • The core deployment package is less than 15M and easy to distribute
  • Can run on any OS, including VM and Container, and even Android
  • Low hardware requirement; can run on a PC

Application integration architecture

Integration-friendly

Working well with various Java frameworks & reporting tools

Called via standard JDBC

High performance

Storage

esProc offers two high-performance storage formats

Bin file

Simple format, no need to define structure

Composite file

Mixed row-based and columnar storage, support index, and need to define structure beforehand

IO performance is more than 15 times higher than conventional databases

Algorithms

In-memory search
  • Location by sequence number
  • Position-based index
  • HASH index
  • Location by multilevel sequence number
  • In-memory table index
External storage search
  • Sorting index
  • Value-attached index
  • Index preloading
  • Batch searching
  • Full-text search
Traversal technology
  • Delayed cursor
  • New interpretation of aggregation
  • Ordered cursor
  • Multi-purpose traversal
  • Pre-filtered traversal
High-efficiency association
  • Foreign key pointerization
  • Foreign key numberization
  • Ordered merge
  • Attached table
  • Unilateral partition join

Algorithm example - TopN

SQL
//To get TopN from entire set,database engine will perform optimization to avoid big sorting SELECT TOP 10 * FROM Orders ORDER BY Amount DESC //For an intra-group TopN,database optimization is disabled by the nested query and can only turn to big sorting SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn FROM Orders ) WHERE rn<=10
SPL
AB
1=file("Orders.ctx").open().cursor()
2=A1.groups(;top(10;-Amount)) Get TopN from entire set
3=A1.groups(Area;top(10;-Amount)) GET TopN from each group
SPL treats the TopN problem as an aggregate operation that returns a set,which gets rid of the big sorting; entire set topN and intra-group TopN have similar code and both have high performance

03Application scenario

Data preparation layer for report & query

  • SPL agile computing increases development efficiency by avoiding complex SQL and stored procedures
  • Computing capability independent of databases; conveniently migratable between databases
  • Interpreted execution; natural hot-swap; loosely coupled report module and application
  • Handle endless report development needs at low cost

Java data logic/microservice implementation

  • Purely Java-based; can be packaged with the main application to enjoy advantages of Java’s mature framework
  • Agile development can replace Stream/Kotlin/ORM
  • Computing capability independent of databases; conveniently migratable between databases
  • Interpreted execution, hot-swap, low coupling

Replacing stored procedures

  • Purely Java-based; can be packaged with the main application to enjoy advantages of Java’s mature framework and avoid disadvantages of stored procedures
  • Powerful procedural computation, easy to debug, and high development efficiency
  • Perform computations independent of databases; intrinsically migratable between databases
  • Do not need to compile privileges of stored procedure; low application coupling; increased security and reliance

Slimming down databases/Eliminating intermediate tables

  • Move non-critical intermediate data from database to files to reduce database storage burden
  • Tree-structure directories are easy to manage and creates low coupling
  • Perform computations outside the database to reduce database’s computing burden
  • File access has higher performance, which greatly increases computing performance

Mixed-source computations/Realtime full data analytics

  • Support of a rich collection of data sources: RDB, NoSQL, File, HTTP,...;and multilevel data such as JSON
  • Direct access and computation without database loading, which brings realtimeness
  • Direct mixed computations between heterogeneous databases (between production database & analytic database) to achieve realtime data analytics
  • Computing capability independent of data sources; intrinsically migratable

Embedded/Edge computing engine

  • The small-size, fully embedded app can be used for edge computing engine
  • All-embracing functionalities, including math library, which gets rid of extra components for most computing tasks
  • Simple file storage without the need of databases
  • Can connect to remote large-scale data sources and storage devices

Preparation for outside-app data cleansing

  • A rich collection of data sources accessed with a unified style, making it convenient to access various data sources where SQL is absent
  • Powerful description ability, making it more concise and convenient than Python to express complex problems
  • Parallel computation, making it faster and convenient than Python in big data handling
  • High integrability, which transforms the computation to an intra-app one as needed

Exploration and analytics by data scientists

  • Powerful description ability, making it more concise and convenient than Python to express complex problems
  • Higher interactivity than SQL and Python, and more convenient to debug
  • File storage creates portable and independent data, which can be analyzed anywhere without databases
  • Parallel computation, making it faster and convenient than Python in big data handling

Desktop data analytics

Hot computation of cold data

  • Historical cold data required to be loaded to the database takes up large amounts of space, causing complex operation & maintenance and high costs yet with a low frequency of use
  • Ad-hoc loading is inefficiency as it takes much more time than the computation time
  • SPL directly computes files without loading them to some kind of “database”, making data not “cold” any more and creating hot computations at low storage costs