User Details
- User Since
- Nov 1 2022, 12:34 PM (91 w, 2 d)
- Availability
- Available
- LDAP User
- Ilias Sarantopoulos
- MediaWiki User
- ISarantopoulos-WMF [ Global Accounts ]
Yesterday
I took a first swing at this, copying over the Dockerfile instructions to the hf blubber image.
At the moment this is failing on install
18.87 × python setup.py egg_info did not run successfully. 18.87 │ exit code: 1 18.87 ╰─> [9 lines of output] 18.87 Traceback (most recent call last): 18.87 File "<string>", line 2, in <module> 18.87 File "<pip-setuptools-caller>", line 34, in <module> 18.87 File "/srv/app/flash-attention-v2/setup.py", line 21, in <module> 18.87 import torch 18.87 File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module> 18.87 from torch._C import * # noqa: F403 18.87 ^^^^^^^^^^^^^^^^^^^^^^ 18.87 ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument 18.87 [end of output] 18.87 18.87 note: This error originates from a subprocess, and is likely not a problem with pip. 19.10 error: metadata-generation-failed 19.10 19.10 × Encountered error while generating package metadata. 19.10 ╰─> See above for output. 19.10 19.10 note: This is an issue with the package mentioned above, not pip. 19.10 hint: See above for details. ------ ERROR: failed to solve: process "/bin/sh -c python3 \"-m\" \"pip\" \"install\" \"-r\" \"huggingface_modelserver/requirements.txt\"" did not complete successfully: exit code: 1
I need to recheck if this is a permissions issue and if so if it would make sense to install flash attention in the pytorch base image in prodcution images instead of the inference-services repository.
I've deployed the new model in the experimental namespace in ml-staging so it is now available for further testing.
I've uploaded the model on swift and in the public analytics space
Tue, Jul 30
@Isaac We're going to solve the numpy issue by relaxing the kserve restriction by using our wmf kserve fork. At some point in the near future it is going to be supported anyway so we will switch to the official release then. It wouldn't make much sense to build things with an older version just to make things work. Thanks for offering to help!
Update: I'm having some issues while building the Lift Wing service which is cause by dependencies.
I'm getting this issue on model load caused by numpy. The issue is that kserve demands numpy <2.0.0. Locally I've had no issue running things in a notebook but with numpy 2.0.0
getting this error Traceback (most recent call last): File "/srv/articlequality/model_server/model.py", line 110, in <module> model = ArticleQualityModel( ^^^^^^^^^^^^^^^^^^^^ File "/srv/articlequality/model_server/model.py", line 50, in __init__ self.load() File "/srv/articlequality/model_server/model.py", line 53, in load self.model = load_pickle(self.model_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/somebody/.local/lib/python3.11/site-packages/statsmodels/iolib/smpickle.py", line 42, in load_pickle return pickle.load(fin) ^^^^^^^^^^^^^^^^ File "/home/somebody/.local/lib/python3.11/site-packages/numpy/random/_pickle.py", line 34, in __bit_generator_ctor raise ValueError(str(bit_generator_name) + ' is not a known ' ValueError: <class 'numpy.random._mt19937.MT19937'> is not a known BitGenerator module.
I'll work on this and provide an update.
Mon, Jul 29
Until now the solution I have found would involve making 2 requests instead of 1 (with examples using the API sandbox):
Fri, Jul 26
@mfossati after our discussion I said I'll provide the links to the mediawiki code in ores extension that makes requests to Lift Wing:
It seems that the above behavior with the increased memory usage is a standard thing.
I redeployed the service and was using 18GB of VRAM. Just a short while after I ran a load test usage went up to 46GB again (grafana link)
I ran a load test with the folowing setup:
duration: 10 minutes
users: 2
output_size(max_tokens): 10-200
prompt_input_size (# words): 15-300
MODEL=huggingface locust [2024-07-26 07:45:58,337] stat1008/INFO/locust.main: Run time limit set to 600 seconds [2024-07-26 07:45:58,337] stat1008/INFO/locust.main: Starting Locust 2.29.1 [2024-07-26 07:45:58,338] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second [2024-07-26 07:45:58,338] stat1008/INFO/locust.runners: All users spawned: {"HuggingfaceServer": 2} (2 total users) [2024-07-26 07:55:57,838] stat1008/INFO/locust.main: --run-time limit reached, shutting down Load test results are within the threshold [2024-07-26 07:55:57,923] stat1008/INFO/locust.main: Shutting down (exit code 0) Type Name # reqs # fails | Avg Min Max Med | req/s failures/s --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- POST /openai/v1/completions 71 0(0.00%) | 13810 2915 21509 14000 | 0.12 0.00 --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- Aggregated 71 0(0.00%) | 13810 2915 21509 14000 | 0.12 0.00
Thu, Jul 25
I've managed to run the locust tests from stat1008 for gemma2-9b-it using the following process:
Wed, Jul 24
With the merged patch we would have the below change in deployment charts:
command: [ "./entrypoint.sh"] args: ["--dtype", "float32"]
I'll have an update on this next week, since this week the team is doing a focus week on LLM work. I've already done some work in the patch seen above.
Tue, Jul 23
Mon, Jul 22
Let's keep this Open until we deploy the new version with the fix to production
Fri, Jul 19
The service is up and running in staging and works as expected.
The production service is still running an older image that works fine and as this is not an urgent thing to deploy (nor a fix) we'll deploy and test production after the following week, as the ML team is doing a focus week on LLMs.
The issue was caused because the cmd args that are passed to the container are not parsed by the model_server_entrypoint.sh script which is the entrypoint for the container.
This means that the command that is run in:
python3 transformers/transformers.py
instead of
python3 transformers/transformers.py --model_name outlink-topic-model --predictor_host outlink-topic-model-predictor-default.articletopic-outlink --http_port 8080
Since the first argument for the script is the python script to execute we parse additional arguments from the 2nd on as follows:
exec /usr/bin/python3 ${MODEL_SERVER_PATH} "${@:2}"
Thu, Jul 18
This service hasn't been deployed for quite a while (last deployed change on 14/12/2023) so there are some changes that have been causing errors.
However it is puzzling that the --predictor_host and name aren't set anywhere, however when inspecting the deployed transformer pod (kubectl describe pod xxxx) we seen the following:
Containers: kserve-container: Image: docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-outlink-transformer:2023-12-14-124100-publish Port: 8080/TCP Host Port: 0/TCP Args: --model_name outlink-topic-model --predictor_host outlink-topic-model-predictor-default.articletopic-outlink --http_port 8080 Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi Environment: WIKI_URL: http://mw-api-int-ro.discovery.wmnet:4680 PORT: 8080 K_REVISION: outlink-topic-model-transformer-default-00020 K_CONFIGURATION: outlink-topic-model-transformer-default K_SERVICE: outlink-topic-model-transformer-default
We need to understand:
- how the model_name and predictor_host are set in the args
- how these args are used by the transformer. Even though they are set, they are not parsed from kserve code. Our code doesn't explicitly set these args on cmd as it calls python3 transformer/transformer.py which is used as an entrypoint. However looking at git history it has been like this all along.
Thanks for the update Isaac!
By looking at the above code + model iiuc the following changes need to be introduced in Lift Wing:
- switch from sklearn to of statsmodels ordinal regression
- change output schema to match the one on the model card
Wed, Jul 17
In the past we have used the envoy proxy to access Lift Wing from mw. Here is the relevant piece of config used by the Ores Extension.
Here is the relevant piece of config in puppet that configures Lift Wing production (service is named as inference) and the one for ml-staging (named as inference-staging).
Tue, Jul 16
The current work can be marked done as we can now deploy images using the huggingfaceserver and in a stable way after completing https://phabricator.wikimedia.org/T369359
The current task can be marked done as after investigation vllm seems to be the most prominent solution for an inference optimization engine and work is continued here -> https://phabricator.wikimedia.org/T370149
The current work can be marked done as we can now deploy images using the huggingfaceserver.
Mon, Jul 15
Resolving this as the previous issue that occurred during a deployment (https://phabricator.wikimedia.org/T369359#9974140) doesn't have anything to do with this task.
I re-deployed the 27b model today and it is running fine:
Fri, Jul 12
Resolving this as it can't be reproduced.
I see you've done a lot of great work on feature engineering and preprocessing so I don't mean to interfere with your work! My suggestion is a bit short sighted as I was looking at it from the perspective of deploying and updating a model. I was hoping to use a gradient boosting model and don't do any normalization (we'd still have to take care of extreme outliers). This way we wouldn't have to maintain a separate csv with the values used in preprocessing an we could still have interpretable features using feature importance attribute of these models.
I've ran all tests for staging and prod. Staging is fine but I get this error on prod eqiad which seems transient:
httpbb --host inference.svc.eqiad.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/production/* Sending to inference.svc.eqiad.wmnet... https://nlwiki-articlequality.revscoring-articlequality.wikimedia.org/v1/models/nlwiki-articlequality:predict (/srv/deployment/httpbb-tests/liftwing/production/test_revscoring-articlequality.yaml:38) ERROR: HTTPSConnectionPool(host='inference.svc.eqiad.wmnet', port=30443): Read timed out. (read timeout=10) === ERRORS: 114 requests attempted to inference.svc.eqiad.wmnet. Errors connecting to 1 host.
Thu, Jul 11
Following up on some of the above:
Tested the updated image in ml-staging using the GPU and got the following error:
2024-07-11 11:00:24.531 1 kserve INFO [storage.py:download():66] Copying contents of /mnt/models to local 2024-07-11 11:00:24.531 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None 2024-07-11 11:00:24.531 1 kserve INFO [storage.py:download():111] Model downloaded in 0.0003374351654201746 seconds. 2024-07-11 11:00:24.532 1 kserve INFO [__main__.py:load_model():204] Loading generative model for task 'text_generation' in torch.bfloat16 2024-07-11 11:00:24.755 1 kserve INFO [generative_model.py:load():206] Decoder-only model detected. Setting padding side to left. 2024-07-11 11:00:25.362 1 kserve INFO [generative_model.py:load():223] Successfully loaded tokenizer Loading checkpoint shards: 100%|██████████| 12/12 [00:18<00:00, 1.51s/it] WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks. 2024-07-11 11:00:44.203 1 kserve ERROR [__main__.py:<module>():259] Failed to start model server: You can't move a model that has some modules offloaded to cpu or disk
Building the same image locally on a m1 doesn't cause this issue which is weird, but it is likely caused from one of the dependencies having different version on m1.