It is consistently giving incorrectly low scores to white subjects and consistently giving incorrectly high scores to black subjects. That is clearly bias, at least in the colloquial sense.
The real question though is whether this is because the model is biased, or whether it is because the two populations have different levels of risk. If group A had a higher risk than group B, than I would expect the model to have a higher rate of false negatives for group A than for group B. This is just because the model is more likely to (correctly) assign members of group A as high risk, and some of these classifications will be wrong. To check for bias you have to control for this base rate difference.
The degree to which it does this cannot be distinguished from random chance (p > 0.05).
If the predictor were biased then you could build a more accurate score based on both the original scores and race_factorBlack:score_factorHigh (and other interaction terms). I.e. you'd be building a new bias in to cancel the old bias, leaving an accurate predictor.
Their analysis doesn't show that this is possible.
p > 0.05 is the type of cutoff you would see to get published in a peer-reviewed paper. Such a high bar of evidence is not necessary in this situation. To prevail in a civil suit, a person harmed by this algorithm would only have to prove that is more likely than not that the algorithm is biased.