About the project

Better Diff is a modular, extendable, and scalable framework in Java that provides functionality to find differences between 2 or more text files, and provides list of modifications (additions, deletions, transpositions, partial / full mutations) for each pair based on the full alignment. It naturally supports both baseless and hyparchetype textual criticism, as well as sequence alignment for unlimited level of performed sub-alignments (verses, lines, words, letters etc.)

The output (apart from the framework itself) is in a form of special commands so it can be easily used by non-Java clients, and makes things like assembling critical edition easily achievable.

It can also be used for sequence alignment of 2 or more nucleic acid sequences or protein sequences. However, the alignment doesn't guarantee neither local nor global optimum at this moment.

				betterdiff ~ $ java -jar betterdiff-lyrics-client.jar -S -f file1.txt -f file2.txt -f file3.txt --verses-weight 80 --lines-weight 60
			  

Download: Libraries (Java 15+)

  • all
  • core
  • extensions
  • utils
  • clients
				
					com.betterdiff.* (zip, includes executable client)
				
				
					com.betterdiff.core.* (jar)
				
				
					com.betterdiff.lyrics.* (jar)
				
				
					com.betterdiff.core.utils.* (jar) 
com.betterdiff.lyrics.utils.* (jar)
com.betterdiff.lyrics.client.* (jar)

Documentation

Architecture

Core
(com.betterdiff.core.*)
Utils
(com.betterdiff.core.utils.*, com.betterdiff.‎ <extension_name>‎ .utils.*)
Extensions
(com.betterdiff.‎ <extension_name> ‎.*)
Clients
(com.betterdiff. <extension_name> .client.*)

Core module provides API for all phases, fields, steps and other elements of the whole process. This API should be always used by both Extensions and Utils so they are compatible with every other extension or utility package.

It also provides extendable framework for Preparation phase and Pairing phase, and full implementation for Alignment phase and Identification phase.

Code module is not dependent on any other module within this framework.

Extensions are modules that extend Core module to some specific purpose. They may or may not be dependent on other extensions, and also may or may not extend other extension.

They should always use API from Core modules, even for classes that not necessarily extend original classes from the Core module. Otherwise they won't be generally compatible with other extensions or with Utils modules.

Extensions shouldn't provide any client specific code and shouldn't have main method implemented, therefore shouldn't be able to run on their own.

Utils are modules that provide functionality on the output of Core module - be it Chunks, PairedChunks, or AlignedChunks - or on the Protocol itself.

They should provide text based operations with original texts and provide the result in computer readable form.

Utils should not be dependent on Extensions or on any other Utils module, but they may provide context-related operations for specific Extensions. This way they stay compatible with every other Util module and may be used by alien Extensions where applicable. The may, however, be dependent on Core Utils module, if needed.

Utils shouldn't provide any client specific code and shouldn't have main method implemented, therefore shouldn't be able to run on their own.

Clients are any front end applications that provide functionality for users. They can also act as middle-men for non-Java applications that want to use this framework. They can be small front end layers to provide easy access for Utils methods, or full blown desktop applications with their own architecture and multi level modularity, extensions etc.

There is no expected compatibility among other clients. Clients are also not expected to be extendable or reusable.

Phases

Preparation -> Pairing -> Alignment -> Identification

The whole process consists of 4 phases. When files are compared, all these phases should be processed in this order at least once. For some levels (see examples) some of these phases can be processed multiple times, but they should always be processed in this order, because the output of Preparation is used by Pairing, its output is used by Alignment, and its output by Identification.

Preparation phase is a phase where chunks should be identified. Chunk is a meaningful part of a text that will be aligned with other chunks. For example, in lyrics, chunk can be a whole verse, or a line, or a word. For protein sequences, a chunk can be a single protein, or a sub-sequence of proteins, or any other part of the whole sequence that should be aligned with other sequences.

Pairing phase is a phase where chunks should be compared with each other. Those chunks that have the same or similar content should have the same id. However, what is considered the same or similar is left to the implementation detail.

Alignment is a phase where chunks are aligned to their final position. In Core module Falling algorithm (c) is used to do this, but it can be extended or replaced by other algorithms that provide more accurate results (for example Smith-Waterman for nucleic acid sequences) or different context-related alignment for alien Extensions.

Identification is a phase where mutations are identified. Alignment implementation shouldn't affect this phase, but some different modes (for example hyparchetype textual criticism) can alter the output.

Elements

Elements are data structures that are calculated inside phases and used for communication between them.

Falling language (Protocol)

Protocol is a sequence of commands that leads from original texts to the final alignment including identification of mutations. Protocol can be used to reproduce the result without the need to calculate all phases again. It can also be used on different text sources with the same structure to reproduce the desired result, and can formally act as a template.

text <ordinal_number>
Request a text on the input.
<text_number> - Ordinal number of inputed text.

chunk <ordinal_number>,[<start_index>,<end_index>] -> [<x_axis>,<y_axis>]
Identify a chunk of text bounded by start and end index in a given text and put the chunk in the alignment matrix.
<ordinal_number> - Ordinal number of inputed text where the chunk has been identified.
<start_index> - Ordinal number of a character in given text where the chunk starts. This character is included in the chunk.
<end_index> - Ordinal number of a character in given text where the chunk ends. This character is included in the chunk.
<x_axis> - X position of the chunk in the alignment matrix.
<x_axis> - Y position of the chunk in the alignment matrix.

Notes.
Every white space including new lines (\n, \n\r, \r) is counted as 1 character.
Start Index and End Index are different from routines like substring.

Example:
This is a dog.
If we split the sentence into words we get 4 chunks:
chunk 1-4 (This)
chunk 6-7 (is)
chunk 9-9 (a)
chunk 11-13 (dog)

match [<x_axis>,<y_axis>] -> <id>
Assign an ID to the specified position in the alignment matrix.
<x_axis> - X position of the chunk in the alignment matrix.
<y_axis> - Y position of the chunk in the alignment matrix.
<id> - Identification number of the chunk.

move <shift_size>,[<x_axis>,<y_axis>]
Move all positions in the alignment matrix down by given shift size. Positions are moved only in the column specified by x_axis and only on start position or below specified by y_axis.
<shift_size> - Total size of the performed shift.
<x_axis> - X position of the starting chunk in the alignment matrix.
<x_axis> - Y position of the starting chunk in the alignment matrix.

Example:
[Chunk 1] [Chunk 2]
[Chunk 3] [Chunk 4]
[Chunk 5] [Chunk 6]

Command: move 2,[1,2]

Result:
[Chunk 1] [Chunk 2]
[(empty)] [Chunk 4]
[(empty)] [Chunk 6]
[Chunk 3] [(empty)]
[Chunk 5] [(empty)]

pick <shift_size>,[<x_axis>,<y_axis>]
Move a single position down by given shift size.
<shift_size> - Total size of the performed shift.
<x_axis> - X position of the chunk in the alignment matrix.
<x_axis> - Y position of the chunk in the alignment matrix.

Example:
[Chunk 1] [Chunk 2]
[(empty)] [Chunk 4]
[Chunk 5] [Chunk 6]

Command: pick 1,[1,1]

Result:
[(empty)] [Chunk 2]
[Chunk 1] [Chunk 4]
[Chunk 5] [Chunk 6]

fin [<x_axis>,<y_axis>]
Mark the given position as finished. It means that the position reached its final alignment and doesn't have to be aligned anymore.
<x_axis> - X position of the chunk in the alignment matrix.
<y_axis> - Y position of the chunk in the alignment matrix.

local
Change the scope of further alignments to the detail. Be aware that there is no way to go back, so the alignment on current level must be finished first before going down the level.

row <sub_row_detail>
Change the row of alignment for the current level to the given row. The scope must in local first.
<sub_row_detail> - Row detail of the current level.

Example:
〚[Chunk 1] 〚[Chunk 2]
[(Chunk 3)]〛 [Chunk 4]〛
〚[Chunk 5] 〚[Chunk 6]
[(Chunk 7)]〛 [Chunk 8]〛

Commands:
move 1,[1,1]
local
row 2
move 1,[1,1]

Result:
〚(empty) 〚[Chunk 2]
[(empty)]〛 [Chunk 4]〛
〚[(empty)] 〚[Chunk 6]
[Chunk 1] [Chunk 8]
[Chunk 3]〛 [(empty)]〛
〚[Chunk 5] 〚[(empty)]
[(Chunk 7)]〛 [(empty)]〛

mut <mutation_type>,[<origianl_x_axis>,<origianl_y_axis>] x [<target_x_axis>,<target_y_axis>]
Mark mutation between two chunks.
<mutation_type> - Mutation type, there are these mutations:
= - Equality
PM - Partial mutation
FM - Full mutation
T - Transposition
A - Addition
D - Deletion
Note. Please bear in mind that in case of baseless comparison the mutations are symmetrical. In such case only one mutation is listed and the symmetrical one is omitted.
<origianl_x_axis> - X position of the original chunk in the alignment matrix.
<origianl_y_axis> - Y position of the original chunk in the alignment matrix.
<target_x_axis> - X position of the mutated chunk in the alignment matrix.
<target_y_axis> - Y position of the mutated chunk in the alignment matrix.

How To

Example

		  
import java.util.List;
import java.util.logging.Level;

import com.betterdiff.core.Callback;
import com.betterdiff.core.alignment.AlignedChunk;
import com.betterdiff.core.alignment.Alignment;
import com.betterdiff.core.identification.Identification;
import com.betterdiff.core.preparation.Preparation;
import com.betterdiff.core.protocol.PartialProtocol;
import com.betterdiff.core.protocol.command.Mutation;
import com.betterdiff.lyrics.pairing.LCSPairing;

public class Example {

	private class MyCallback extends Callback {

		public MyCallback(Level level) {
			super(level);
		}
			
		@Override
		public void log(Level level, String message) {
			System.out.println("Callback: " + level.getName() + ", message: " + message);
		}
		
	}

	private class MyPreparation extends Preparation {
		
		public MyPreparation(Callback callback) {
			super(callback);
		}
		
		@Override
		protected void processText(String text, int ordinalNumber) {
			int currentPosition = 0;
			for (String line: text.split("\n")) {
				if (!line.isEmpty()) {
					super.addChunk(
						ordinalNumber,
						currentPosition + 1,
						currentPosition + line.length());	
				}
				
				currentPosition += line.length() + 1;
			}
		}

		@Override
		protected void processAllTexts(List texts) {
			for (int i = 1; i <= texts.size(); i++) {
				this.processText(texts.get(i - 1), i);
			}
		}
		
	}
	
	public void example() {

		Callback callback = new MyCallback(Level.INFO);
		
		MyPreparation myPreparation = new MyPreparation(callback);
		myPreparation.addText("aaa\nbbb");
		myPreparation.addText("aaa\nccc");
		PartialProtocol preparationProtocol = myPreparation.process();
		
		LCSPairing lcsPairing = new LCSPairing(myPreparation, callback, 70);
		PartialProtocol lcsPairingProtocol = lcsPairing.process();
		
		Alignment alignment = new Alignment(lcsPairing, callback);
		PartialProtocol alignmentProtocol = alignment.process();
		
		Identification identification = new Identification(alignment, callback);
		PartialProtocol identificationProtocol = identification.process();
		
		// Aligned chunks in the matrix - this is the final alignment
		List alignedChunks = alignment.getAlignedChunks();
		
		// Mutations between chunks in the aligned matrix
		List mutations = identification.getMutations();
	
	}
}
		  
		  

JavaDoc

com.betterdiff.core.*
com.betterdiff.lyrics.*
com.betterdiff.core.utils.*
com.betterdiff.lyrics.utils.*

License, Contact

Ladislav Asenbrener, troomar@gmail.com
License: CC BY-NC-ND 3.0
https://creativecommons.org/licenses/by-nc-nd/3.0/legalcode
https://creativecommons.org/licenses/by-nc-nd/3.0/