[ENH] V1 → V2 API Migration - datasets by JATAYU000 · Pull Request #1608 · openml/openml-python

JATAYU000 · 2026-01-08T10:30:37Z

Metadata

Reference Issue: [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions

codecov-commenter · 2026-01-08T10:36:04Z

Codecov Report

❌ Patch coverage is 54.88372% with 582 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.47%. Comparing base (d421b9e) to head (ddb0774).

Files with missing lines	Patch %	Lines
openml/_api/resources/dataset.py	33.12%	317 Missing ⚠️
openml/_api/resources/base/versions.py	24.71%	67 Missing ⚠️
openml/_api/clients/http.py	78.50%	46 Missing ⚠️
openml/_api/clients/minio.py	54.41%	31 Missing ⚠️
openml/_api/setup/backend.py	65.16%	31 Missing ⚠️
openml/_api/resources/base/fallback.py	26.31%	28 Missing ⚠️
openml/datasets/dataset.py	24.00%	19 Missing ⚠️
openml/datasets/functions.py	27.77%	13 Missing ⚠️
openml/_api/setup/_utils.py	56.00%	11 Missing ⚠️
openml/testing.py	55.55%	8 Missing ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1608      +/-   ##
==========================================
- Coverage   52.04%   51.47%   -0.58%     
==========================================
  Files          36       63      +27     
  Lines        4333     5366    +1033     
==========================================
+ Hits         2255     2762     +507     
- Misses       2078     2604     +526

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JATAYU000 · 2026-01-09T05:08:45Z

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request
download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml
download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

Issues:

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times
Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.
Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

Example:

current load_features() ref link
This calls a function which downloads and returns a file path and then parse from the file path
This can be changed by changing that function's definition ref link to get -> parse -> return features instead of file paths

def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
        return _features

Or by updating the Dataset class to use the underlining interface method from api_context directly.

def _load_features(self) -> None:
       ...
        self._features = api_context.backend.datasets.get_features(self.dataset_id)

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

geetu040

Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.

geetu040 · 2026-01-13T17:43:32Z

openml/_api/resources/base/resources.py

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        *,
+        data_id: list[int] | None = None,  # type: ignore
+        **kwargs: Any,
+    ) -> pd.DataFrame: ...


can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?

Oh that v2 signature was experimental, idk how pre-commits did not catch that, Will make them same

Is mypy supposed to catch that?

yes unused parameters are caught under #ARG001 as seen in the cache_directory params.

geetu040 · 2026-01-13T17:43:43Z

openml/_api/resources/dataset.py

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        *,
+        data_id: list[int] | None = None,  # type: ignore
+        **kwargs: Any,
+    ) -> pd.DataFrame:


you can make this simple using private helper methods

@geetu040 I could not find a good common block to make an helper since the filters are passed via url in v1 and via json object in v2 , and both have different parsing, If you have any specific Idea on that please let me know

looking at DatasetsV1 I can think of these helper methods: _build_list_call, _parse_and_validate_xml, _parse_dataset
you can do something similar for DatasetsV2 though they can be different.

you can do something similar for DatasetsV2 though they can be different.

I see, That opens more options.

geetu040 · 2026-01-13T17:43:53Z

openml/_api/resources/datasets.py

+        bool
+            True if the deletion was successful. False otherwise.
+        """
+        return openml.utils._delete_entity("data", dataset_id)


if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.

Makes Sense , It would look like a delete request from client along with exception handling

geetu040 · 2026-01-13T17:43:57Z

openml/_api/resources/datasets.py

+    def list(
+        self,
+        limit: int,
+        offset: int,
+        **kwargs: Any,
+    ) -> pd.DataFrame:


same as above, it can use private helper methods

geetu040 · 2026-01-13T17:44:00Z

openml/datasets/functions.py

-    # Minimalistic check if the XML is useful
-    if "oml:data_qualities_list" not in qualities:
-        raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"')
+    from openml._api import api_context


can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.

It does raise circular import

geetu040 · 2026-01-14T09:16:26Z

FYI @geetu040 Currently the get_dataset() function has 3 download requirement

Thanks for a detailed explanation, I now have good understanding of the download mechanism.

download_data : uses api_calls._download_minio_bucket() to download all the files in the bucket if download_all_files param was True and api_calls._download_minio_file() to download the dataset.pq file if it was not found in cache. When download parquet fails it fallback to download dataset.arff file with get request

minio can be handled easily, we will use a separate client along with HTTPClient or implement it's methods in the HTTPClient, which work independently of the api version

download_features : if feature_file is passed via init it extracts during initialization else does get request and caches the xml

download_qualities : if qulities_file is passed via init it extracts during initialization else does get request and caches the xml

these are actually different objects in both apis, v1 uses xml and v2 keeps them in json

The data files .pq and .arff are common for versions and doesn't make sense to be downloaded multiple times

yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format xml or json, slightly different structure, but if parsed most are identical, so shouldn't this rule be applied to all the responses?

Path handling for download to return the path especially the data files, As mentioned in the meet I can try the Download specific class which uses the cache mixin and only inherited by dataset resource.

I don't understand this point

Current implementation in OpenMLDataset has v1 specific parsing which in my opinion should be using the current interface (api_context)

agreed, should be handled by api_context

Another option is to add return_path to client requests, which in my opinion would be wasteful since adding a param to all the methods of client for just the dataset resource, and that too which could be handled without it as mentioned above.

agreed, adding return_path for just one specific method of one resource is not preffered

in conclusion, I may ask, if we ignore the fact that it downloads the .arff files for both versions separately, does everything else works out smooth without any blocker? I think ignoring this part is not really bad because conceptually this rule could be applied to almost every other response object

JATAYU000 · 2026-01-15T05:20:18Z

@geetu040 making a new client for minio handles just the parquet file, we would still need to migrate download_text_file() for the arff file (this is also used by tasks and runs)
So maybe we can have a DownloadClient which can contain all of these along with a save method which can save content to a specified path and hence also fixes our issue with features/qualities path ?

FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle

return path of saved file (feature, qualities, arff, pq)
downloader for arff or implementation of download_text_file() which is used for arff download
minio file and bucket download for the pq file

geetu040 · 2026-01-15T16:24:02Z

From the standup discussion and earlier conversations, I think we can agree on a few points:

Have a separate client for MinIO interactions alongside HTTPClient. In future if we plan to add more providers like dropbox, google drive, e.t.c, we don't end up with too many changes, rather have their client implemented as a separate class and just pass that down to relevant resource.
DownloadClient doesn't feel like the right abstraction; instead, implement download-specific methods directly in HTTPClient.

Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs.

JATAYU000 · 2026-01-16T04:35:27Z

The points do make sense to me, I will propose the design along with how it would be used in the resource.

JATAYU000 · 2026-01-19T06:08:02Z

@geetu040 I have a design implemented which needs reviews

MinIOClient similar to HTTPClient is being used by DatasetAPI from self._minio , It implements 2 methods download file and download bucket, it uses _get_cache_dir() for the destination
New method download implemented under HTTPclient that can be used for features,qualities and arff files, along with specific v1/v2 interface using handler callback.

Question:

most methods signature include cache_directory how should that be handled? if the directory is passed use that if not use our cache dir? i am not sure how this would effect the old users
Also the caching implemented currently suggest the Response() is cached but I remember from a meeting you mentioned the respective files (.xml .json) are cached, I am not sure about it , I have went through the design as if the caching is done on the response.

geetu040 · 2026-01-19T07:40:19Z

@geetu040 I have a design implemented which needs reviews

I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.

most methods signature include cache_directory how should that be handled? if the directory is passed use that if not use our cache dir? i am not sure how this would effect the old users

Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the CacheMixin class. How does that sound?

Also the caching implemented currently suggest the Response() is cached but I remember from a meeting you mentioned the respective files (.xml .json) are cached, I am not sure about it , I have went through the design as if the caching is done on the response.

CacheMixin._set_cache_response will look at the response object and extract json or xml content from it and save it respectively in .json and .xml files.
CacheMixin._get_cache_response will read these files (.json or .xml) from the given path and create a dummy Response object then fill it with status_code and content. Therefore a Response object will be returned.

This reverts commit fd43c48.

geetu040 · 2026-02-03T18:04:03Z

tests/test_datasets/test_dataset.py



-
-def test__check_qualities():


why remove this test?

got it, since it's moved to resource api, should be tested at resource class level then

source: openml#1606 (comment)

- removing this since it was not part of the sdk previously - some tests fail because of the timeout in stacked PRs - this option can easily be added if needed in future

EmanAbdelhaleem · 2026-02-04T17:50:38Z

openml/_api/resources/dataset.py

+        original_data_url: str | None = None,
+        paper_url: str | None = None,
+    ) -> int:
+        raise NotImplementedError(self._not_supported(method="edit"))


You can just use self._not_supported(method="edit")
No need for raise NotImplementedError()

EmanAbdelhaleem · 2026-02-04T17:54:11Z

tests/test_api/test_datasets.py

+class TestDatasetV1API(TestAPIBase):
+    def setUp(self):
+        super().setUp()
+        self.client = self._get_http_client(


since this is V1, using
self.client = self.http_client will do.

since the recent change, it can be used from self.http_clients[APIVersion.V1]

Alright, I have made the change

geetu040

can you please look at the failing test, is it related to your code? tests/test_runs/test_run_functions.py::TestRun::test_format_prediction_non_supervised
maybe use design from here if it helps tests/test_api/test_versions.py
your tests involve publishing something, that should be deleted, for this cleanup you can use TestBase._mark_entity_for_removal

geetu040 · 2026-02-05T10:49:46Z

tests/test_api/test_datasets.py

+        self.dataset_v2 = DatasetV2API(self.v2_client,self.minio_client)
+        self.dataset_fallback = FallbackProxy(self.dataset_v1,self.dataset_v2)
+
+    @pytest.mark.uses_test_server()


if all the methods use @pytest.mark.uses_test_server(), you can simply move this decorator above the class

Yes, will do

geetu040 · 2026-02-05T10:49:57Z

tests/test_datasets/test_dataset.py

        assert isinstance(xy, pd.DataFrame)
        assert xy.shape == (150, 5)

+    @pytest.mark.skip("Datasets cache")


remove this instead of skipping, since we are not going to use it anywhere in future.

geetu040 · 2026-02-05T10:50:03Z

tests/test_datasets/test_dataset_functions.py

-    )
+    content_xml = content_file.read_text()
+    requests_mock.delete(ANY, text=content_xml, status_code=412)
+


why not using create_request_response

geetu040 and others added 11 commits December 30, 2025 09:11

set up folder structure and base code

0159f47

Merge branch 'main' into migration

58e9175

Merge branch 'main' into migration

bdd65ff

fix pre-commit

52ef379

Merge base migration pr, ruff

f7ba710

refactor

5dfcbce

implement cache_dir

2acbe99

refactor

af99880

Merge branch 'main' into pr/1576

74ab366

edit, fork, delete updated

8964517

Added features, updated list

1c2fa99

JATAYU000 added 3 commits January 9, 2026 10:49

Merge commit pull/1576 into dataset_resource

18e85de

Refactor functions, except get

9bcbcb3

Remove circular import using lazy import

96df5e3

geetu040 mentioned this pull request Jan 9, 2026

[ENH] V1 → V2 API Migration #1575

Open

25 tasks

geetu040 suggested changes Jan 13, 2026

View reviewed changes

Modify reviews, feature and qualities

c955f43

geetu040 added 2 commits January 15, 2026 14:51

undo changes in tasks/functions.py

4c75e16

Merge branch 'main' into migration

5762185

geetu040 mentioned this pull request Jan 16, 2026

[ENH] V1 → V2 API Migration - Tasks #1611

Open

JATAYU000 added 2 commits January 17, 2026 17:20

Merge base pr

3ad7268

Download methods

3e7c415

JATAYU000 and others added 16 commits February 2, 2026 18:41

Merge commit openml/pull/1576

23fe19b

Merge branch 'dataset_resource'

be29dc9

implement get/set_config_values

b2287c3

improve APIBackend.set_config_values

b7e285e

use LegacyConfig

fd43c48

Revert "use LegacyConfig"

f4aab6b

This reverts commit fd43c48.

implement _sync_api_config

d43cf86

update tests with _sync_api_config

3e323ed

rename config: timeout -> timeout_seconds

9195fa6

use timedelta for default ttl value

5342eec

update tests, adds v2/fallback

adc0e74

add MinIOClient in TestBase

bfb2d3e

publish,tag methods need testing

707e1f1

fix linting for builder

cabaecf

new migration tests

79cf49c

Merge /1576

5c8791a

geetu040 reviewed Feb 3, 2026

View reviewed changes

geetu040 added 4 commits February 4, 2026 13:57

fix unbound variables: "code", "message"

85c1113

source: openml#1606 (comment)

use requests.Session()

39bf86a

remove "timeout_seconds" entirely

7b66677

- removing this since it was not part of the sdk previously - some tests fail because of the timeout in stacked PRs - this option can easily be added if needed in future

update/refactor tests

d2224c4

EmanAbdelhaleem reviewed Feb 4, 2026

View reviewed changes

geetu040 added 3 commits February 5, 2026 15:27

remove unused current_api_version from TestAPIBase

9608c36

make TestAPIBase inherit TestBase

f6bc7f7

nits: test classes

baa3a38

geetu040 suggested changes Feb 5, 2026

View reviewed changes

JATAYU000 added 3 commits February 5, 2026 17:17

Review changes, new tests

29c93d1

Merge bse migration

7674b3a

Doc strings

ddb0774

Uh oh!

Conversation

JATAYU000 commented Jan 8, 2026

Metadata

Uh oh!

codecov-commenter commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JATAYU000 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues:

Example:

Uh oh!

geetu040 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JATAYU000 Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geetu040 commented Jan 14, 2026

Uh oh!

JATAYU000 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geetu040 commented Jan 15, 2026

Uh oh!

JATAYU000 commented Jan 16, 2026

Uh oh!

JATAYU000 commented Jan 19, 2026

Uh oh!

geetu040 commented Jan 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmanAbdelhaleem Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geetu040 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 8, 2026 •

edited

Loading

JATAYU000 commented Jan 9, 2026 •

edited

Loading

geetu040 left a comment •

edited

Loading

JATAYU000 Jan 14, 2026 •

edited

Loading

JATAYU000 commented Jan 15, 2026 •

edited

Loading

EmanAbdelhaleem Feb 4, 2026 •

edited

Loading