Skip to content

Author missed due to HTML encoded characters #4760

@mrombout

Description

@mrombout

Description

ScanCode misses the author Ceki Gülcü in slf4j-api/src/main/java/org/slf4j/LoggerFactory.java. It detects the two other authors correctly, so the ü encoded characters seem to be the problem.

How To Reproduce

Create a file LoggerFactory.java with the following content:

/**
 * The <code>LoggerFactory</code> is a utility class producing Loggers for
 * various logging APIs, e.g. logback, reload4j, log4j and JDK 1.4 logging.
 * Other implementations such as {@link org.slf4j.helpers.NOPLogger NOPLogger} and
 * SimpleLogger are also supported.
 *
 * <p><code>LoggerFactory</code>  is essentially a wrapper around an
 * {@link ILoggerFactory} instance provided by a {@link SLF4JServiceProvider}.
 *
 * <p>
 * Please note that all methods in <code>LoggerFactory</code> are static.
 *
 * @author Alexander Dorokhine
 * @author Robert Elliot
 * @author Ceki G&uuml;lc&uuml;
 *
 */
$ ./scancode --copyright --json LoggerFactory.json LoggerFactory.java
$ cat LoggerFactory.json | jq
{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "v32.5.0-5-g022ddc859b",
      "options": {
        "input": [
          "LoggerFactory.java"
        ],
        "--copyright": true,
        "--json": "LoggerFactory.json"
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2026-02-18T112250.368774",
      "end_timestamp": "2026-02-18T112250.445016",
      "output_format_version": "4.1.0",
      "duration": 0.07626080513000488,
      "message": null,
      "errors": [],
      "warnings": [],
      "extra_data": {
        "system_environment": {
          "operating_system": "linux",
          "cpu_architecture": "64",
          "python_version": "3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]"
        },
        "spdx_license_list_version": "3.27",
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "LoggerFactory.java",
      "type": "file",
      "copyrights": [],
      "holders": [],
      "authors": [
        {
          "author": "Alexander Dorokhine",
          "start_line": 13,
          "end_line": 13
        },
        {
          "author": "Robert Elliot",
          "start_line": 14,
          "end_line": 14
        }
      ],
      "scan_errors": []
    }
  ]
}
$ cat LoggerFactory.json | jq ".files[].authors"

It only finds the following two authors, Ceki Gülcü is missing.

[
  {
    "author": "Alexander Dorokhine",
    "start_line": 13,
    "end_line": 13
  },
  {
    "author": "Robert Elliot",
    "start_line": 14,
    "end_line": 14
  }
]

If I modify the file to replace the &uuml with ü, the author is detected. Though ü has been replaced/normalized to a u.

  {
    "author": "Geci Gulcu",
    "start_line": 15,
    "end_line": 15
  }

System configuration

  • What OS are you running on? Linux
  • What version of scancode-toolkit was used to generate the scan file? v32.5.0-5-g022ddc859b
  • What installation method was used to install/run scancode? Source

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions