Conversation
|
Thanks for the pull request, @jesperhodge! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
There was a problem hiding this comment.
Pull request overview
This PR fixes tag rename behavior so that updating a tag’s value with only capitalization changes (e.g., Science → science) is treated as a valid update rather than a duplicate-value error. This aligns the tagging API behavior with the Course Authoring taxonomy UI rename flow.
Changes:
- Add a regression test covering capitalization-only tag rename via the taxonomy tags update endpoint.
- Update tag value uniqueness validation to exclude the tag being updated from the “already exists” check.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
tests/openedx_tagging/test_views.py |
Adds a test ensuring capitalization-only renames succeed (HTTP 200) and return the updated value. |
src/openedx_tagging/rest_api/v1/serializers.py |
Adjusts uniqueness validation to avoid flagging the tag itself as a duplicate during updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| taxonomy_id = context.get("taxonomy_id") | ||
| original_tag = Tag.objects.filter(taxonomy_id=taxonomy_id, value=original_value).first() if original_value else None | ||
| tag_id = original_tag.pk if original_tag else None | ||
| if taxonomy_id is not None: | ||
| # Check if tag value already exists within this taxonomy. If so, raise a validation error. | ||
| queryset = Tag.objects.filter(taxonomy_id=taxonomy_id, value=value) | ||
|
|
||
| # Don't compare tag against itself when validating its updated value. | ||
| if tag_id: | ||
| queryset = queryset.exclude(pk=tag_id) |
There was a problem hiding this comment.
validate_tag_value now does an extra DB query (.first()) to resolve the original tag before running the uniqueness check. If you want to keep renames as a single-query validation, you can either fetch only the pk via values_list('pk', flat=True) (and only when taxonomy_id is present) or fold the self-exclusion into the existing uniqueness query so it doesn't need the extra lookup.
There was a problem hiding this comment.
On a related note, do we even need this parallel validation in the serializer? The API code and models are already doing the same validation elsewhere in the code, so I'm wondering if we can just catch those errors and report them correctly (as 400 errors), or if we really need this extra validation logic here to do that.
There was a problem hiding this comment.
Yes, user input validation logic belongs in the view, not in the model. This validator was added in another issue because of missing validation on the model. ValidationErrors coming from the model did not catch duplicate tag updates in the first place, which caused an IntegrityError from the database - something that should never happen if you have proper validations. So it was necessary to add this additional validation.
As to where to put the validation, putting more validations on the model but allowing requests to pass through to the model even though they are not considered valid, and then running any other code that might be part of instantiating the model, invites security risks. Having user input validations in the model is in general already a red flag to me.
And there's no such thing as too much validation. If you want to have additional validations in the model, that is fine, but it does not replace the need for (possibly the same) validations in the View / Serializer.
There was a problem hiding this comment.
The API code and models are already doing the same validation elsewhere in the code
No they are not: #536
There was a problem hiding this comment.
validate_tag_valuenow does an extra DB query (.first()) to resolve the original tag before running the uniqueness check. If you want to keep renames as a single-query validation, you can either fetch only the pk viavalues_list('pk', flat=True)(and only whentaxonomy_idis present) or fold the self-exclusion into the existing uniqueness query so it doesn't need the extra lookup.
@bradenmacdonald @mgwozdz-unicon Regarding copilot's review:
An extra DB Query is intended and fine, and really makes no significant difference performance-wise.
There was a problem hiding this comment.
allowing requests to pass through to the model even though they are not considered valid, and then running any other code that might be part of instantiating the model, invites security risks.
Sure, in a theoretical sense, but this is not sensitive code. And generally this is the way Django is designed to work: https://docs.djangoproject.com/en/6.0/ref/models/instances/#validating-objects https://docs.djangoproject.com/en/6.0/topics/forms/modelforms/#validation-on-a-modelform (user input validation explicitly involves model validation)
Having user input validations in the model is in general already a red flag to me.
I understand that, but I tend to approach this in the opposite perspective. Data flows via REST API -> python API -> ORM model -> database. But some operations work directly on the database. Others use the ORM models directly. Others use the python API. And finally, many requests go through the REST API. I tend to try to keep all validation as "low level" as possible (i.e. in the database itself) because then it applies to 100% of changes. The higher level your validation is (e.g. in the serializer), the higher the chance of someone bypassing it and getting invalid data into your system.
That said, you're absolutely right that there's nothing wrong with too much validation, but in the spirit of keeping things consistent, any high-level validation should try to share code definitions like RESERVED_TAG_CHARS etc. with the low-level validations so we're sure the same rules are being applied consistently.
Django and DRF both try to support this approach: ModelForm generates validators automatically from the model validators and also calls the model validation explicitly. DRF ModelSerializer generates fields (and validators) from your model fields/validators automatically. (Although they do recommend replacing them with explicit serializers that you write manually.)
Feel free to leave your validation code in, especially the part that uses RESERVED_TAG_CHARS but I personally think we don't think we need the check for duplicate values, as it's easier to just call full_clean() and use Django's check. I guess I'm OK with either approach though.
There was a problem hiding this comment.
Okay. Since this ticket is just a fix for the validation code and not an addition of validation code, removing it here would be out of scope anyway.
Some comments for posterity:
Any API endpoint is sensitive code and needs to be guarded against security breaches.
In terms of relying on the perspective to keep validation as "low level" as possible, I don't see that as acceptable. This is against security best practices. As stated in the OWASP documentation, "Input validation should happen as early as possible in the data flow." (https://cheatsheetseries.owasp.org/cheatsheets/Input_Validation_Cheat_Sheet.html)
As for Django model serialization and validation, they are an additional layer of possible validation and security, but not a replacement. The standard in the DRF world is also to use the validation methods provided by the View Serializers.
There was a problem hiding this comment.
Haha, well, they are supposed to be. Let's try to keep the code DRY and not have two totally separate validation code paths. Please see efa9128 for how to fix the existing validation so it runs everywhere, and results in 400 errors as appropriate in the API without the need to add any new redundant validation at the serializer level.
As I have outlined above, this is against security best practices. If you want to consolidate validation, you can think about extending DRF's View Serializer validations, but I would prefer not to remove any validations from the View level. That is setting a bad pattern inviting security flaws.
There was a problem hiding this comment.
If you like to keep things DRY:
What do you think about extracting the core logic (like RESERVED_TAG_CHARS or the regex patterns) into a shared utility? That way, both the Serializer and the Model utilize the exact same rule, but they execute it at their respective, appropriate layers.
There was a problem hiding this comment.
RESERVED_TAG_CHARS is already being shared among the layers, and yeah there is lots of other cleanup and validation like trimming whitespace that could be moved into a shared utility but it's way out of scope for this PR. Let's just focus on solving the bug.
I would prefer not to remove any validations from the View level
It was only added last week, and it introduced a bug because it does the validation wrong, whereas the existing validation code was already correct but was simply not being reported correctly as 400 error rather than a 500 error.
This is really not a security-sensitive context*, so my preference is to focus on implementing it correctly, and avoiding duplicate logic is one way to do that. However, your perspective is completely valid. I'm fine with moving forward with your approach as long as you update the documentation of your validator to clarify that we're trying to validate early and replicate the model validation which will still run later. I don't want developers being confused about what validation happens where nor which is the "true" validation and which is extra input validation.
* Tags are not used for access control, taxonomies are typically scoped to a single organization, these are simple string values, and there's no XSS/code/SQL/HTML injection risk.
|
Just so I'm clear: Is the idea that we don't allow tags that differ just by case, i.e. I couldn't have both "Science" and "science" at the same time, but we do want to allow changing case, so "science" can be renamed "Science"? |
| taxonomy_id = context.get("taxonomy_id") | ||
| original_tag = Tag.objects.filter(taxonomy_id=taxonomy_id, value=original_value).first() if original_value else None | ||
| tag_id = original_tag.pk if original_tag else None | ||
| if taxonomy_id is not None: | ||
| # Check if tag value already exists within this taxonomy. If so, raise a validation error. | ||
| queryset = Tag.objects.filter(taxonomy_id=taxonomy_id, value=value) | ||
|
|
||
| # Don't compare tag against itself when validating its updated value. | ||
| if tag_id: | ||
| queryset = queryset.exclude(pk=tag_id) |
There was a problem hiding this comment.
On a related note, do we even need this parallel validation in the serializer? The API code and models are already doing the same validation elsewhere in the code, so I'm wondering if we can just catch those errors and report them correctly (as 400 errors), or if we really need this extra validation logic here to do that.
|
Hi @ormsbee ,
That is correct, at least according to Jenna's remarks in UAT. But we should not check that new value of tag A is different than old value of A. That leads to unexpected validation errors, such as the case here where someone wants to rename a tag and he gets an error. Regardless of whether he should get an error or not (according to UAT and common sense, no), throwing an error that says the new tag value is not unique - when in fact it is unique, just similar to the tag's old value - would be incorrect. |
Fixes openedx/modular-learning#274
Testing Instructions