Is there some built-in utility method to sanitize ...
# getting-started
v
Is there some built-in utility method to sanitize a String for usage in a File name/path?
h
The nuclear option is Base64.getUrlEncoder()… Other than that, I don't know of any.
v
... while staying readable I should have said 😄
What I do now is
"$foo".replace("""\W++""".toRegex(), "_")
j
Doesn't this rule out a bunch of valid things, like
.
or
-
?
Also, am I about to learn something, or is the second
+
unnecessary?
r
Depending on your definition of readable https://docs.oracle.com/javase/8/docs/api/java/net/URLEncoder.html#encode-java.lang.String-java.lang.String- is probably close to what you want
v
Doesn't this rule out a bunch of valid things
Yes, it is just the ultra-nuclear-option. 😄
Also, am I about to learn something, or is the second
+
unnecessary?
Actually as it is in the end of the regex it is indeed unncessary here
j
Could you please elaborate? In which case
++
is different from a single
+
?
v
It is possessive, so prevents backtracking. That means it can fail-fast for a no-match if used properly. If you e.g. have the String "0001". If you match with
0+2
, the regex engine matches
000
then sees 1 is not 2. Then it backtracks and matches
00
then sees 0 is not 2. Then it backtracks and matches
0
then sees 0 is not 2. Now there is no way to backtrack more and the match fails. If you match with
0++2
, it matches
000
, sees 1 is not 2. As
++
is possessive it cannot backtrack into the match and immediately the match fails. But if used wrongly, you can get false-negatives. If you for example match
0++01
against
0001
you will not have a match either as the
0++
possesively matches the
000
and 1 != 0 as next character. So the match fails, while with
0+01
it would have succeeded. Simplified, if the thing you repeat cannot appear directly after it, you can make the quantifier possessive to get the fail-fast behavior.
I can recommend https://www.regular-expressions.info/. It is a great resource when it comes to regexes and also subtle differences between the many dialects.
j
TIL. I knew about greedy (the default) vs reluctant (
?
), but didn't know about this third possibility. Thanks.
👌 1
v
Yeah, not all dialects have them
JS for example does not have it
👌 1
a
Is there some built-in utility method to sanitize a String for usage in a File name/path?
sanitize as in ‘make look pretty’ or also check to make sure the path is absolute? E.g.
"../test 123"
is invalid because it’s relative, or it would be prettified to
"___test_123"
?
c
invalid because it’s relative
“invalid” is situational here - a relative path is a valid path from the OS’ perspective; it may not be valid for your use case. Ditto for HTTP URLs: /something/../whatever is a valid HTTP URL path - but your app may not permit relative paths (I always prevent relative paths by default here for security reasons).
v
In my case I meant getting out something that can be used as file-name
a
yes that’s right, it’s situational. I wanted to ask because of course
.
and
/
are valid path characters, but in some contexts they’re not permissible
👍 1
v
For now I just use
"$foo".replace("""[^\w.-]++""".toRegex(), "_").removeSuffix("_")
a
on JVM there’s URLEncoder, but that’s not going to look as pretty as what you have. That’s the only built-in thing I can think of…
v
Yeah, was already suggested above, thanks.
r
Another relevant question is if your code may ever be used by people who don't primarily use latin alphabet. If so you'll quickly have problems with turning everything outside A-Z into
_
v
true 😄 but quite unlikely for the task at hand, but thanks for noting
j
Something to consider as well is that there is a high chance that 2 different paths end up identical through this regex transformation. Especially since you're replacing entire sequences of "bad" characters by a single
_
. Depending on your case, it may or may not be a problem. The
URLEncoder
approach doesn't suffer from this problem
v
Yep, I'm aware of that and don't think so. And if, it will fail with "file" exists already, then I can think about it. 😄
👍 1
a
You could use a property delegate, and use the property name as a file name. Even if escaped with backticks I think all property names would be valid? This would be complicated, but technically speaking it’s built-in
v
How would that help? I would have to set the property name, then I could as well just set the file name. The string comes from arbitrary input in a properties file.
a
r
Not sure if it's been mentioned, but valid for paths and files is also platform dependent. For example,
:
is a valid character in Linux, but not Windows. When I need to worry about platform independent names, I like to limit the POSIX portable set, but as far as how to sanitize is a different can of worms. For example, do you want to remove invalid characters/combinations or replace them?
(Depending on the use case, I'll often do a more restricted version where I don't allow`.`, but that really only makes sense when you control the extension.)