Segmentation and batch processing with legacy data systems

In the healthcare environment old data system are often relied upon as they are heavily integrated in patient pathways. When these systems are to integrate with newer systems, they are often referred to as legacy systems. The legacy systems were designed for a specific purpose and scope. With the ever growing data generation and complexity, these systems are not fit for the sudden increase in data demands. Despite this increase in demand, they are crucial and heavily integrated into the current systems and processes. There systems are still need to be utilised as part of a clinical trials and for other supporting research (as in this case).

In the field of data engineering we are often faced with such problems. In this work, in order to deal with the problem of using legacy system with much larger data demands, an application was developed to interface with legacy data processing systems. This application was used for segmentation and batch processing with legacy data systems.

The specific problem which this application solved is explained further.

Problem

The problem was that we heavily relied on legacy code bases for data merging and transformation. Theses system were designed and worked well for amounts of data at the time of development. But they do not scale well to the larger amounts of data we have today. There was no short term alternative to the legacy code bases. In order to interface with the legacy systems and deal with the much larger data demands an application was developed for segmentation and batch processing.

Legacy system: data merging

In order to explain the segmentation solution, it is first necessary to understand the function of the legacy system. For this explanation, the merge process from a legacy system with be the focus.

This will be explained using generic terminology. We have a ‘main file’, and some number of ‘supporting files’ (sup files). The output of the legacy system here is such that; each record of the main file, merged with information found somewhere in the supporting files.

Originally, when running this process using the legacy systems, the full data set could be imported and loaded into memory. The system could then be used to run the various data merges to calculate the final output. Using the full data sets was no problem at the time.

When data is large, there is no way of doing this, hence we need to segment the input files and batch process.

Solution

This section describes the solution developed to solve the problem outlined above. The focus will be on how the segmentation process works, and the two types of segmentation that can be performed as part of the application developed in this work.

Segmentation Process

There are two main methods of segmentation and batch processing used in this work.

  • Take the main file, and segment supporting files. Known as Single Segmentation.
  • Segment the main file, and then produce segmented supporting file for each segmented base file. Know as Multi Segmentation.

Single Segmentation

In the single segmentation process, the main file is used to segment the supporting files.

Single Segmentation

Multi Segmentation

In the multi segmentation process, the main file is first segmented. Then each of these segments is used to further segment each of the supporting files

Multi Segmentation

Assumptions

The segmentation process explained above assumes that the files are in sorted order by id. This is actually a naive approach to segmentation. The reason is that if we cannot guarantee the main file is sorted by ids, then we could in fact end up with a segmentation which is exactly the same as the set of input files in terms of size and complexity. This would be bad for two reasons:

  • The processing for segmentation to output files would be a waste
  • The legacy code would simply not work with the segmented files even when used in a batch processing scenario

As an example. consider that there are 4 unique ids. If these ids are repeated in the main file such that they are distributed in an uneven way. We could segment our main file, such that each segment contains one of the 4 ids. This will mean the subsequent segmentation process will not produce files which have a reduction in file size. This means the output would still contain the complexities of the input.

Software

This section outlines some interesting and novel aspects of the software application. The particular aspects of interest in this application were the ‘Chunking Algorithm’.

Chunking Algorithm

This function reads an input file and chunks it into separate files based on the numChunksVariable and the number of rows in the input file.

Algorithm

This algorithm iterates over each line in the input file and outputs the line to a new output file based on the:

  • number of chunked output files specified
  • while preserving a complete set of ids is not split across multiple files

The algorithm works by first calculating the number of rows in the input file. This is done by iterating over all lines in the file. This is scalable to large files as the file data is not held in memory. Using this number, and the given numChunks variable, the number of items per chunk can be calculated.

numRowsPerChunk = totalNumRows / numChunks

The input file is then iterated line by line outputting the rows to a new file. The remained of a division calculation is used to determine when a new segment should be created, and thus if a new file to output to should be created or not.

rowCounter % numRowsPerChunk

The remained of the calculation is 0 if the rowCounter is divisible exactly by the numRowsPerChunk. Which indicates that the rowCounter is at the end of the number of rows that should be added to a file.

Example of reminder calculation

A simple example of the remainder calculation is show below. The file has 6 rows, the number of chunks specified is 3. Therefore we should chunk the input file into 3 files, each containing 2 rows. Not considering headers.

 numChunks       = 3
 totalNumRows    = 6
 numRowsPerChunk = 2

0
rowCounter(0) % numRowsPerChunk(2) = 0
NewFile 1
Write row to file num 1


1
rowCounter(1) % numRowsPerChunk(2) = 1
chunkFileCounter = 1
Write row to file num 1


2
rowCounter(2) % numRowsPerChunk(2) = 0
NewFile 2
Write row to file num 2


3
rowCounter(3) % numRowsPerChunk(2) = 1
Write row to file num 2


4
rowCounter(4) % numRowsPerChunk(2) = 0
 NewFile 3
Write row to file num 3


5
rowCounter(5) % numRowsPerChunk(2) = 1
Write row to file num 3.csv
Uneven file distribution

The algorithm has a logical check to make sure that an uneven distribution of rows are still divided into the specified number of files given. The if statement ensures that any extra rows are simply added to the last file. This statement only created a new file while the number of files do not exceed the number of chunks.

if not chunkFileCounter>numChunks

Preserve ids in single files

The algorithm also have a logical check to make sure that a set of ids is not split across multiple files. This part of the algorithm helps preserve that a set of ids for a single patient is kept to one file, and is not partially split amongst different files.

This works by checking the value of the id column of the previous row. If the same and current row, then don’t split to a new file. If the current row value is different, then it is fine to split into different files. This feature is auto enabled for all file chunking using this methods as it is an important requirement for this application.

Pseudocode

The algorithm Pseudocode code which combines all of the concepts explained above in to the algorithm.

For each row in input

	currentIdValue=row[columnId]			

	if rowCounter % numRowsPerChunk == 0:				
		chunkFileCounter=chunkFileCounter+1

		if not chunkFileCounter>numChunks:
			createNewFileFlag=True

	if createNewFileFlag:

		# if current and previous id not equal, then is a new person.
		#    and we not exceeding the max number of file chunks
		#    then create a new file
		if (not previousIdValue == currentIdValue) and (not chunkFileCounter>numChunks):

			currentFileName = update file name using chunkFileCounter
			currentFileWriter = getNewFileWriter(currentFileName)
			createNewFileFlag=False

	currentFileWriter.writerow(row) 		
	rowCounter=rowCounter+1 
	previousIdValue=currentIdValue
Endfor	
Function output

This function returns list of chunkedFileNames as the output. This is then used for further segmentation processes in the main method of the application.

Examples

Some examples of the algorithm above in action on some sample files.

Example A

All data should be written to one single file. ie, no file chunks created.

main
 chunkFile[data/testdata1/main.csv]
 numChunks       = 1
 totalNumRows    = 6
 numRowsPerChunk = 6
0
rowCounter(0) / numRowsPerChunk(6) = 0
rowCounter(0) % numRowsPerChunk(6) = 0
chunkFileCounter = 0
 NewFile
 chunkFileCounter = 1
Write row to file num = seg1.csv
['001', 'aa', 'bb']

1
rowCounter(1) / numRowsPerChunk(6) = 0
rowCounter(1) % numRowsPerChunk(6) = 1
chunkFileCounter = 1
Write row to file num = seg1.csv
['001', 'aa', 'bb']

2
rowCounter(2) / numRowsPerChunk(6) = 0
rowCounter(2) % numRowsPerChunk(6) = 2
chunkFileCounter = 1
Write row to file num = seg1.csv
['003', 'aa', 'bb']

3
rowCounter(3) / numRowsPerChunk(6) = 0
rowCounter(3) % numRowsPerChunk(6) = 3
chunkFileCounter = 1
Write row to file num = seg1.csv
['003', 'aa', 'bb']

4
rowCounter(4) / numRowsPerChunk(6) = 0
rowCounter(4) % numRowsPerChunk(6) = 4
chunkFileCounter = 1
Write row to file num = seg1.csv
['004', 'aa', 'bb']

5
rowCounter(5) / numRowsPerChunk(6) = 0
rowCounter(5) % numRowsPerChunk(6) = 5
chunkFileCounter = 1
Write row to file num = seg1.csv
['004', 'aa', 'bb']

Example B

Data should be written to 2 files. Number of 6 lines should be chunked accross two files, seg1 and seg2

main
 chunkFile[data/testdata1/main.csv]
 numChunks       = 2
 totalNumRows    = 6
 numRowsPerChunk = 3
0
rowCounter(0) / numRowsPerChunk(3) = 0
rowCounter(0) % numRowsPerChunk(3) = 0
chunkFileCounter = 0
 NewFile
 chunkFileCounter = 1
Write row to file num = seg1.csv
['001', 'aa', 'bb']

1
rowCounter(1) / numRowsPerChunk(3) = 0
rowCounter(1) % numRowsPerChunk(3) = 1
chunkFileCounter = 1
Write row to file num = seg1.csv
['001', 'aa', 'bb']

2
rowCounter(2) / numRowsPerChunk(3) = 0
rowCounter(2) % numRowsPerChunk(3) = 2
chunkFileCounter = 1
Write row to file num = seg1.csv
['003', 'aa', 'bb']

3
rowCounter(3) / numRowsPerChunk(3) = 1
rowCounter(3) % numRowsPerChunk(3) = 0
chunkFileCounter = 1
 NewFile
 chunkFileCounter = 2
Write row to file num = seg2.csv
['003', 'aa', 'bb']

4
rowCounter(4) / numRowsPerChunk(3) = 1
rowCounter(4) % numRowsPerChunk(3) = 1
chunkFileCounter = 2
Write row to file num = seg2.csv
['004', 'aa', 'bb']

5
rowCounter(5) / numRowsPerChunk(3) = 1
rowCounter(5) % numRowsPerChunk(3) = 2
chunkFileCounter = 2
Write row to file num = seg2.csv
['004', 'aa', 'bb']

Example C

The number of input lines is 6, the number of chunks specified is 3. This means we should have 3 files, each with Number of 6 lines should be chunked accross three files, seg1, seg2 and seg3.

main
 chunkFile[data/testdata1/main.csv]
 numChunks       = 3
 totalNumRows    = 6
 numRowsPerChunk = 2
 TEST numRowsPerChunk = 2.0
 TEST numRowsPerChunk = 2.0
 TEST numRowsPerChunk = 2
 TEST numRowsPerChunk = (0.0, 2.0)

 TEST numRowsPerChunk = 0.0
 TEST numRowsPerChunk = 0.0
 TEST numRowsPerChunk = 0
 TEST numRowsPerChunk = (0.0, 0.0)
0
rowCounter(0) / numRowsPerChunk(2) = 0
rowCounter(0) % numRowsPerChunk(2) = 0
chunkFileCounter = 0
 chunkFileCounter = 1
 NewFile
data/testdata1/main.seg1.csv
Write row to file num = data/testdata1/main.seg1.csv
['001', 'aa', 'bb']

1
rowCounter(1) / numRowsPerChunk(2) = 0
rowCounter(1) % numRowsPerChunk(2) = 1
chunkFileCounter = 1
Write row to file num = data/testdata1/main.seg1.csv
['001', 'aa', 'bb']

2
rowCounter(2) / numRowsPerChunk(2) = 1
rowCounter(2) % numRowsPerChunk(2) = 0
chunkFileCounter = 1
 chunkFileCounter = 2
 NewFile
data/testdata1/main.seg2.csv
Write row to file num = data/testdata1/main.seg2.csv
['003', 'aa', 'bb']

3
rowCounter(3) / numRowsPerChunk(2) = 1
rowCounter(3) % numRowsPerChunk(2) = 1
chunkFileCounter = 2
Write row to file num = data/testdata1/main.seg2.csv
['003', 'aa', 'bb']

4
rowCounter(4) / numRowsPerChunk(2) = 2
rowCounter(4) % numRowsPerChunk(2) = 0
chunkFileCounter = 2
 chunkFileCounter = 3
 NewFile
data/testdata1/main.seg3.csv
Write row to file num = data/testdata1/main.seg3.csv
['004', 'aa', 'bb']

5
rowCounter(5) / numRowsPerChunk(2) = 2
rowCounter(5) % numRowsPerChunk(2) = 1
chunkFileCounter = 3
Write row to file num = data/testdata1/main.seg3.csv
['004', 'aa', 'bb']

Example D

The number of input lines is 7, the number of chunks specified is 3. This means we should have 3 files, each with 2 items, and the last remaining item will be places in the last file. which will contain 3 items. Number of 7 lines should be chunked accross three files, seg1, seg2 and seg3.

main
 chunkFile[data/testdata1/main.csv]
 numChunks       = 3
 totalNumRows    = 7
 numRowsPerChunk = 2 ~ totalNumRows not exactly divisable by numChunks. remainders added to last file.
 TEST numRowsPerChunk = 2.0
 TEST numRowsPerChunk = 2.0
 TEST numRowsPerChunk = 2
 TEST numRowsPerChunk = (0.0, 2.0)

 TEST numRowsPerChunk = 1.0
 TEST numRowsPerChunk = 1.0
 TEST numRowsPerChunk = 1
 TEST numRowsPerChunk = (0.0, 1.0)
0
rowCounter(0) / numRowsPerChunk(2) = 0
rowCounter(0) % numRowsPerChunk(2) = 0
chunkFileCounter = 0
 chunkFileCounter = 1
 NewFile
data/testdata1/main.seg1.csv
Write row to file num = data/testdata1/main.seg1.csv
['001', 'aa', 'bb']

1
rowCounter(1) / numRowsPerChunk(2) = 0
rowCounter(1) % numRowsPerChunk(2) = 1
chunkFileCounter = 1
Write row to file num = data/testdata1/main.seg1.csv
['001', 'aa', 'bb']

2
rowCounter(2) / numRowsPerChunk(2) = 1
rowCounter(2) % numRowsPerChunk(2) = 0
chunkFileCounter = 1
 chunkFileCounter = 2
 NewFile
data/testdata1/main.seg2.csv
Write row to file num = data/testdata1/main.seg2.csv
['003', 'aa', 'bb']

3
rowCounter(3) / numRowsPerChunk(2) = 1
rowCounter(3) % numRowsPerChunk(2) = 1
chunkFileCounter = 2
Write row to file num = data/testdata1/main.seg2.csv
['003', 'aa', 'bb']

4
rowCounter(4) / numRowsPerChunk(2) = 2
rowCounter(4) % numRowsPerChunk(2) = 0
chunkFileCounter = 2
 chunkFileCounter = 3
 NewFile
data/testdata1/main.seg3.csv
Write row to file num = data/testdata1/main.seg3.csv
['004', 'aa', 'bb']

5
rowCounter(5) / numRowsPerChunk(2) = 2
rowCounter(5) % numRowsPerChunk(2) = 1
chunkFileCounter = 3
Write row to file num = data/testdata1/main.seg3.csv
['004', 'aa', 'bb']

6
rowCounter(6) / numRowsPerChunk(2) = 3
rowCounter(6) % numRowsPerChunk(2) = 0
chunkFileCounter = 3
 chunkFileCounter = 4
Write row to file num = data/testdata1/main.seg3.csv
['005', 'aa', 'bb']

Usage

There is some test data and a test configuration file included in the repository for this software.

The files contained:

  • main.csv
  • supA.csv
  • supB.csv
  • supC.csv

The config file:

[mainFile]
dir=data/testdata1
mainFileName = %(dir)s/main.csv
mainFileColumnId = 0
numChunks = 3

[supportingFiles]
dir=data/testdata1

supportingFileNames = %(dir)s/supA.csv
		      %(dir)s/supB.csv
  		      %(dir)s/supC.csv

supportingFileColumnIds = 0
			  0
			  0

It is possible to use this program in two modes. These modes are switched using the ‘numChunks’ param in the mainFile config section. If no param is supplied, then there will be no chunking of the main input file. If a value is supplied, then this number will be used to split the main input file into equal segments.

Example command used to execute the program

> python FileSegmentationApp.py testdata1/config.ini

Code sample available FileSegmentationForBatchJobs.

Program Time and Memory Complexity Analysis

http://bigocheatsheet.com/

Time Complexity:

The time complexity of the algorithm is ~2n. n for reading the input and determining the size of the file. and n for iterating over the file in order to create the segments.

Possible solution to reduce complexity:

  • Use a threshold value for the input file size. This could be an estimate based on input file size in bytes which can be access without reading lines of the file.
  • How to iterative over input file less than n times? Possibly use a divide and conqueror type approach. Start a middle of file, then determine segments? Not a solution, just an idea.

Memory The memory footprint is low. But I/O and processing is high. This decision was made in order to scale to large input file size. ie.

  • The input files are never stored into memory.
  • The output data is flushed periodically and not stored in memory. This requires more processing and I/O, but is saleable to large files sizes.

Creating your first programming language is easier than you think,
...also looks great on your resume/cv.