We had to split each image into 256x256 px overlapping tiles. Each image was divided into about 1600 samples without cracks and 150 with ones to balance the dataset.
The metric to evaluate our model was the Sørensen–Dice coefficient. After training, the value of this metric is 95% on the train set and 93% on the validation set.