[ENH] V1 → V2 API Migration - datasets#1608
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1608 +/- ##
==========================================
- Coverage 52.04% 51.47% -0.58%
==========================================
Files 36 63 +27
Lines 4333 5366 +1033
==========================================
+ Hits 2255 2762 +507
- Misses 2078 2604 +526 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
FYI @geetu040 Currently the
Issues:
Example:current def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
return _featuresOr by updating the Dataset class to use the underlining interface method from api_context directly. def _load_features(self) -> None:
...
self._features = api_context.backend.datasets.get_features(self.dataset_id)Another option is to add |
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| *, | ||
| data_id: list[int] | None = None, # type: ignore | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: ... |
There was a problem hiding this comment.
can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?
There was a problem hiding this comment.
Oh that v2 signature was experimental, idk how pre-commits did not catch that, Will make them same
There was a problem hiding this comment.
Is mypy supposed to catch that?
There was a problem hiding this comment.
yes unused parameters are caught under #ARG001 as seen in the cache_directory params.
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| *, | ||
| data_id: list[int] | None = None, # type: ignore | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
you can make this simple using private helper methods
There was a problem hiding this comment.
@geetu040 I could not find a good common block to make an helper since the filters are passed via url in v1 and via json object in v2 , and both have different parsing, If you have any specific Idea on that please let me know
There was a problem hiding this comment.
looking at DatasetsV1 I can think of these helper methods: _build_list_call, _parse_and_validate_xml, _parse_dataset
you can do something similar for DatasetsV2 though they can be different.
There was a problem hiding this comment.
you can do something similar for DatasetsV2 though they can be different.
I see, That opens more options.
openml/_api/resources/datasets.py
Outdated
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.
There was a problem hiding this comment.
Makes Sense , It would look like a delete request from client along with exception handling
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
same as above, it can use private helper methods
openml/datasets/functions.py
Outdated
| # Minimalistic check if the XML is useful | ||
| if "oml:data_qualities_list" not in qualities: | ||
| raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"') | ||
| from openml._api import api_context |
There was a problem hiding this comment.
can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.
There was a problem hiding this comment.
It does raise circular import
Thanks for a detailed explanation, I now have good understanding of the download mechanism.
minio can be handled easily, we will use a separate client along with
these are actually different objects in both apis, v1 uses xml and v2 keeps them in json
yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format
I don't understand this point
agreed, should be handled by
agreed, adding in conclusion, I may ask, if we ignore the fact that it downloads the |
|
@geetu040 making a new client for FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle
|
|
From the standup discussion and earlier conversations, I think we can agree on a few points:
Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs. |
|
The points do make sense to me, I will propose the design along with how it would be used in the resource. |
|
@geetu040 I have a design implemented which needs reviews
Question:
|
I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.
Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the
|
This reverts commit fd43c48.
|
|
||
|
|
||
|
|
||
| def test__check_qualities(): |
There was a problem hiding this comment.
got it, since it's moved to resource api, should be tested at resource class level then
- removing this since it was not part of the sdk previously - some tests fail because of the timeout in stacked PRs - this option can easily be added if needed in future
openml/_api/resources/dataset.py
Outdated
| original_data_url: str | None = None, | ||
| paper_url: str | None = None, | ||
| ) -> int: | ||
| raise NotImplementedError(self._not_supported(method="edit")) |
There was a problem hiding this comment.
You can just use self._not_supported(method="edit")
No need for raise NotImplementedError()
tests/test_api/test_datasets.py
Outdated
| class TestDatasetV1API(TestAPIBase): | ||
| def setUp(self): | ||
| super().setUp() | ||
| self.client = self._get_http_client( |
There was a problem hiding this comment.
since this is V1, using
self.client = self.http_client will do.
There was a problem hiding this comment.
since the recent change, it can be used from self.http_clients[APIVersion.V1]
There was a problem hiding this comment.
Alright, I have made the change
geetu040
left a comment
There was a problem hiding this comment.
- can you please look at the failing test, is it related to your code?
tests/test_runs/test_run_functions.py::TestRun::test_format_prediction_non_supervised - maybe use design from here if it helps tests/test_api/test_versions.py
- your tests involve publishing something, that should be deleted, for this cleanup you can use
TestBase._mark_entity_for_removal
tests/test_api/test_datasets.py
Outdated
| self.dataset_v2 = DatasetV2API(self.v2_client,self.minio_client) | ||
| self.dataset_fallback = FallbackProxy(self.dataset_v1,self.dataset_v2) | ||
|
|
||
| @pytest.mark.uses_test_server() |
There was a problem hiding this comment.
if all the methods use @pytest.mark.uses_test_server(), you can simply move this decorator above the class
tests/test_datasets/test_dataset.py
Outdated
| assert isinstance(xy, pd.DataFrame) | ||
| assert xy.shape == (150, 5) | ||
|
|
||
| @pytest.mark.skip("Datasets cache") |
There was a problem hiding this comment.
remove this instead of skipping, since we are not going to use it anywhere in future.
| ) | ||
| content_xml = content_file.read_text() | ||
| requests_mock.delete(ANY, text=content_xml, status_code=412) | ||
|
|
There was a problem hiding this comment.
why not using create_request_response
Metadata
Reference Issue: [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions