Conversation
|
@dpol1 Since you are already familiar with our codebase, feel free to provide feedback on this PR :) |
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java
Outdated
Show resolved
Hide resolved
1. URLUtil.toURL(String) utility method
- Added URLUtil.toURL(String) that wraps new URI(url).toURL() and converts URISyntaxException into MalformedURLException —
callers no longer need to handle two exception types
- Fixed getHostSegments(String) to use toURL() and removed URISyntaxException from its signature
2. Pre-sanitization in BasicURLNormalizer.sanitizeForURI()
- Encodes common illegal URI characters before parsing: | → %7C, \ → %5C, space → %20, { → %7B, } → %7D
- Converts non-standard %uXXXX percent encoding to proper UTF-8 percent encoding (e.g., %u011f → %C4%9F)
- Applied at the start of filter() so all subsequent URI parsing succeeds
|
@rzo1 thanks for tagging me, appreciate that - the changes LGTM! However I was going through the documentation and noticed:
Might be worth a follow-up issue, or documenting it as a known behavior change. What do you think? |
|
Yes - that is the downside of the URL deprecation. A follow-up issue documenting this as a known behavior change seems the right call, with a note in the upgrade/migration guide pointing out that pre-existing URLs in the status index may need to be re-normalized (or that a sanitization pass on the seeds (store) may be advisable before upgrading). An alternative to handle that, could be to move the sanitization (from the normalizer) to |
indeed
Agreed, very clever issue handling - I hadn't thought of that. |
Add URLUtil.toURL() and URLUtil.toURI() as drop-in replacements for the deprecated new URL(String) constructor. Both use a two-phase approach: strict RFC 3986 parsing first, then fallback sanitization of common illegal characters (pipes, backslashes, spaces, curly braces, %uXXXX encoding) to preserve the leniency of the old URL constructor. Replace all new URI(url).toURL() calls across the codebase with URLUtil.toURL(). Update BasicURLNormalizer to sync its working string via toExternalForm() after parsing, eliminating the need for separate string sanitization.
|
This should now handle the said (hidden) regression for already persistent data. |
mvolikas
left a comment
There was a problem hiding this comment.
Looks good. I strongly agree on documenting the new behavior with concrete URL examples.
For all changes
Is there a issue associated with this PR? Is it referenced in the commit message?
Does your PR title start with
#XXXXwhereXXXXis the issue number you are trying to resolve?Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
Is the code properly formatted with
mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?For code changes
mvn clean verify?