Transformers cli clean command (#37657)

* transformers-cli -> transformers

* Chat command works with positional argument

* update doc references to transformers-cli

* doc headers

* deepspeed

---------

Co-authored-by: Joao Gante <joao@huggingface.co>
This commit is contained in:
Lysandre Debut
2025-04-30 13:15:43 +02:00
committed by GitHub
parent 63cd4c76f3
commit d538293f62
49 changed files with 399 additions and 386 deletions

View File

@@ -161,7 +161,7 @@ The downside is that if you aren't used to them, it may take some time to get us
Run the command below to start and complete the questionnaire with some basic information about the new model. This command jumpstarts the process by automatically generating some model code that you'll need to adapt.
```bash
transformers-cli add-new-model-like
transformers add-new-model-like
```
## Create a pull request
@@ -292,7 +292,7 @@ Once you're able to run the original checkpoint, you're ready to start adapting
## Adapt the model code
The `transformers-cli add-new-model-like` command should have generated a model and configuration file.
The `transformers add-new-model-like` command should have generated a model and configuration file.
- `src/transformers/models/brand_new_llama/modeling_brand_new_llama.py`
- `src/transformers/models/brand_new_llama/configuration_brand_new_llama.py`
@@ -551,10 +551,10 @@ While this example doesn't include an image processor, you may need to implement
If you do need to implement a new image processor, refer to an existing image processor to understand the expected structure. Slow image processors ([`BaseImageProcessor`]) and fast image processors ([`BaseImageProcessorFast`]) are designed differently, so make sure you follow the correct structure based on the processor type you're implementing.
Run the following command (only if you haven't already created the fast image processor with the `transformers-cli add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
Run the following command (only if you haven't already created the fast image processor with the `transformers add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
```bash
transformers-cli add-fast-image-processor --model-name your_model_name
transformers add-fast-image-processor --model-name your_model_name
```
This command will generate the necessary imports and provide a pre-filled template for the fast image processor. You can then modify it to fit your model's needs.

View File

@@ -25,12 +25,12 @@ Check model leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_
This guide shows you how to quickly start chatting with Transformers from the command line, how build and format a conversation, and how to chat using the [`TextGenerationPipeline`].
## transformers-cli
## transformers CLI
Chat with a model directly from the command line as shown below. It launches an interactive session with a model. Enter `clear` to reset the conversation, `exit` to terminate the session, and `help` to display all the command options.
```bash
transformers-cli chat --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct
transformers chat Qwen/Qwen2.5-0.5B-Instruct
```
<div class="flex justify-center">
@@ -40,7 +40,7 @@ transformers-cli chat --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct
For a full list of options, run the command below.
```bash
transformers-cli chat -h
transformers chat -h
```
The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating).
@@ -76,16 +76,16 @@ print(response[0]["generated_text"][-1]["content"])
(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!
So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
some wild stuff, like that Warhol guy's soup cans and all that jazz.
And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.
Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)
And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
@@ -107,9 +107,9 @@ print(response[0]["generated_text"][-1]["content"])
```
```txt
(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(sarcastically) Oh, yeah, real original, Andy.
But, you know, back in the '60s, it was like, a big deal. People were all about challenging the

View File

@@ -81,10 +81,10 @@ print(f"The predicted token is: {predicted_token}")
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers-cli run --task fill-mask --model google-bert/bert-base-uncased --device 0
echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google-bert/bert-base-uncased --device 0
```
</hfoption>
@@ -256,4 +256,4 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran
[[autodoc]] models.bert.modeling_tf_bert.TFBertForPreTrainingOutput
[[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput
[[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput

View File

@@ -35,7 +35,7 @@ The example below demonstrates how to generate code with [`Pipeline`], or the [`
<hfoptions id="usage">
<hfoption id="Pipeline">
```py
import torch
from transformers import pipeline
@@ -76,7 +76,7 @@ prompt = "# Function to calculate the factorial of a number\ndef factorial(n):"
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(
**input_ids,
**input_ids,
max_new_tokens=256,
cache_implementation="static"
)
@@ -92,10 +92,10 @@ print(filled_text)
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "# Function to calculate the factorial of a number\ndef factorial(n):" | transformers-cli run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
echo -e "# Function to calculate the factorial of a number\ndef factorial(n):" | transformers run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
```
</hfoption>
@@ -146,7 +146,7 @@ visualizer("""def func(a, b):
- Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
```py
from transformers import LlamaForCausalLM, CodeLlamaTokenizer
tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
model = LlamaForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf")
PROMPT = '''def remove_non_ascii(s: str) -> str:
@@ -155,7 +155,7 @@ visualizer("""def func(a, b):
'''
input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"]
generated_ids = model.generate(input_ids, max_new_tokens=128)
filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
print(PROMPT.replace("<FILL_ME>", filling))
```

View File

@@ -49,9 +49,9 @@ model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", t
messages = [{"role": "user", "content": "How do plants make energy?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
cache_implementation="static",
)
@@ -59,11 +59,11 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
# pip install -U flash-attn --no-build-isolation
transformers-cli chat --model_name_or_path CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
transformers chat CohereForAI/c4ai-command-r-v01 --torch_dtype auto --attn_implementation flash_attention_2
```
</hfoption>
@@ -85,9 +85,9 @@ model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", t
messages = [{"role": "user", "content": "How do plants make energy?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
cache_implementation="static",
)

View File

@@ -83,10 +83,10 @@ print(f"Predicted label: {predicted_label}")
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "I love using Hugging Face Transformers!" | transformers-cli run --task text-classification --model distilbert-base-uncased-finetuned-sst-2-english
echo -e "I love using Hugging Face Transformers!" | transformers run --task text-classification --model distilbert-base-uncased-finetuned-sst-2-english
```
</hfoption>
@@ -213,7 +213,3 @@ echo -e "I love using Hugging Face Transformers!" | transformers-cli run --task
</jax>
</frameworkcontent>

View File

@@ -45,9 +45,9 @@ import torch
from transformers import pipeline
classifier = pipeline(
task="text-classification",
model="bhadresh-savani/electra-base-emotion",
torch_dtype=torch.float16,
task="text-classification",
model="bhadresh-savani/electra-base-emotion",
torch_dtype=torch.float16,
device=0
)
classifier("This restaurant has amazing food!")
@@ -64,7 +64,7 @@ tokenizer = AutoTokenizer.from_pretrained(
"bhadresh-savani/electra-base-emotion",
)
model = AutoModelForSequenceClassification.from_pretrained(
"bhadresh-savani/electra-base-emotion",
"bhadresh-savani/electra-base-emotion",
torch_dtype=torch.float16
)
inputs = tokenizer("ELECTRA is more efficient than BERT", return_tensors="pt")
@@ -78,10 +78,10 @@ print(f"Predicted label: {predicted_label}")
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "This restaurant has amazing food." | transformers-cli run --task text-classification --model bhadresh-savani/electra-base-emotion --device 0
echo -e "This restaurant has amazing food." | transformers run --task text-classification --model bhadresh-savani/electra-base-emotion --device 0
```
</hfoption>
@@ -96,12 +96,12 @@ echo -e "This restaurant has amazing food." | transformers-cli run --task text-c
```py
# Example of properly handling padding with attention masks
inputs = tokenizer(["Short text", "This is a much longer text that needs padding"],
padding=True,
inputs = tokenizer(["Short text", "This is a much longer text that needs padding"],
padding=True,
return_tensors="pt")
outputs = model(**inputs) # automatically uses the attention_mask
```
- When using the discriminator for a downstream task, you can load it into any of the ELECTRA model classes ([`ElectraForSequenceClassification`], [`ElectraForTokenClassification`], etc.).
## ElectraConfig

View File

@@ -41,7 +41,7 @@ import torch
from transformers import pipeline
pipeline = pipeline(
task="text-generation",
task="text-generation",
model="tiiuae/falcon-7b-instruct",
torch_dtype=torch.bfloat16,
device=0
@@ -76,11 +76,11 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
# pip install -U flash-attn --no-build-isolation
transformers-cli chat --model_name_or_path tiiuae/falcon-7b-instruct --torch_dtype auto --attn_implementation flash_attention_2 --device 0
transformers chat tiiuae/falcon-7b-instruct --torch_dtype auto --attn_implementation flash_attention_2 --device 0
```
</hfoption>
@@ -150,4 +150,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
## FalconForQuestionAnswering
[[autodoc]] FalconForQuestionAnswering
- forward
- forward

View File

@@ -39,7 +39,7 @@ import torch
from transformers import pipeline
pipeline = pipeline(
"text-generation",
"text-generation",
model="tiiuae/falcon-mamba-7b-instruct",
torch_dtype=torch.bfloat16,
device=0
@@ -73,10 +73,10 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
transformers-cli chat --model_name_or_path tiiuae/falcon-mamba-7b-instruct --torch_dtype auto --device 0
transformers chat tiiuae/falcon-mamba-7b-instruct --torch_dtype auto --device 0
```
</hfoption>

View File

@@ -80,10 +80,10 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "LLMs generate text through a process known as" | transformers-cli run --task text-generation --model google/gemma-2b --device 0
echo -e "LLMs generate text through a process known as" | transformers run --task text-generation --model google/gemma-2b --device 0
```
</hfoption>
@@ -114,8 +114,8 @@ model = AutoModelForCausalLM.from_pretrained(
input_text = "LLMs generate text through a process known as."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(
**input_ids,
max_new_tokens=50,
**input_ids,
max_new_tokens=50,
cache_implementation="static"
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@@ -127,7 +127,7 @@ Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/bl
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("LLMs generate text through a process known as")
visualizer("LLMs generate text through a process known as")
```
<div class="flex justify-center">

View File

@@ -58,7 +58,7 @@ pipe("Explain quantum computing simply. ", max_new_tokens=50)
</hfoption>
<hfoption id="AutoModel">
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -80,16 +80,16 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```
echo -e "Explain quantum computing simply." | transformers-cli run --task text-generation --model google/gemma-2-2b --device 0
echo -e "Explain quantum computing simply." | transformers run --task text-generation --model google/gemma-2-2b --device 0
```
</hfoption>
</hfoptions>
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
```python
@@ -118,7 +118,7 @@ Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/bl
```python
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("You are an assistant. Make sure you print me")
visualizer("You are an assistant. Make sure you print me")
```
<div class="flex justify-center">
@@ -137,7 +137,7 @@ visualizer("You are an assistant. Make sure you print me")
inputs = tokenizer(text="My name is Gemma", return_tensors="pt")
max_generated_length = inputs.input_ids.shape[1] + 10
past_key_values = HybridCache(config=model.config, max_batch_size=1,
past_key_values = HybridCache(config=model.config, max_batch_size=1,
max_cache_len=max_generated_length, device=model.device, dtype=model.dtype)
outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
```

View File

@@ -99,10 +99,10 @@ print(processor.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model google/gemma-3-1b-pt --device 0
echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model google/gemma-3-1b-pt --device 0
```
</hfoption>

View File

@@ -64,10 +64,10 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "Hello, I'm a language model" | transformers-cli run --task text-generation --model openai-community/gpt2 --device 0
echo -e "Hello, I'm a language model" | transformers run --task text-generation --model openai-community/gpt2 --device 0
```
</hfoption>
@@ -82,16 +82,16 @@ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"openai-community/gpt2-xl",
quantization_config=quantization_config,
device_map="auto"
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")

View File

@@ -75,10 +75,10 @@ output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0
echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0
```
</hfoption>

View File

@@ -74,10 +74,10 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model huggyllama/llama-7b --device 0
echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model huggyllama/llama-7b --device 0
```
</hfoption>

View File

@@ -74,10 +74,10 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
transformers-cli chat --model_name_or_path meta-llama/Llama-2-7b-chat-hf --torch_dtype auto --attn_implementation flash_attention_2
transformers chat meta-llama/Llama-2-7b-chat-hf --torch_dtype auto --attn_implementation flash_attention_2
```
</hfoption>
@@ -175,4 +175,3 @@ visualizer("Plants create energy through a process known as")
[[autodoc]] LlamaForSequenceClassification
- forward

View File

@@ -76,10 +76,10 @@ tokenizer.decode(predictions).split()
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "San Francisco 49ers cornerback Shawntae Spencer will miss the rest of the <mask> with a torn ligament in his left knee." | transformers-cli run --task fill-mask --model allenai/longformer-base-4096 --device 0
echo -e "San Francisco 49ers cornerback Shawntae Spencer will miss the rest of the <mask> with a torn ligament in his left knee." | transformers run --task fill-mask --model allenai/longformer-base-4096 --device 0
```
</hfoption>
@@ -147,42 +147,42 @@ echo -e "San Francisco 49ers cornerback Shawntae Spencer will miss the rest of t
## LongformerForMaskedLM
[[autodoc]] LongformerForMaskedLM
[[autodoc]] LongformerForMaskedLM
- forward
## LongformerForSequenceClassification
[[autodoc]] LongformerForSequenceClassification
[[autodoc]] LongformerForSequenceClassification
- forward
## LongformerForMultipleChoice
[[autodoc]] LongformerForMultipleChoice
[[autodoc]] LongformerForMultipleChoice
- forward
## LongformerForTokenClassification
[[autodoc]] LongformerForTokenClassification
[[autodoc]] LongformerForTokenClassification
- forward
## LongformerForQuestionAnswering
[[autodoc]] LongformerForQuestionAnswering
[[autodoc]] LongformerForQuestionAnswering
- forward
## TFLongformerModel
[[autodoc]] TFLongformerModel
[[autodoc]] TFLongformerModel
- call
## TFLongformerForMaskedLM
[[autodoc]] TFLongformerForMaskedLM
[[autodoc]] TFLongformerForMaskedLM
- call
## TFLongformerForQuestionAnswering
[[autodoc]] TFLongformerForQuestionAnswering
[[autodoc]] TFLongformerForQuestionAnswering
- call
## TFLongformerForSequenceClassification
@@ -192,10 +192,10 @@ echo -e "San Francisco 49ers cornerback Shawntae Spencer will miss the rest of t
## TFLongformerForTokenClassification
[[autodoc]] TFLongformerForTokenClassification
[[autodoc]] TFLongformerForTokenClassification
- call
## TFLongformerForMultipleChoice
[[autodoc]] TFLongformerForMultipleChoice
[[autodoc]] TFLongformerForMultipleChoice
- call

View File

@@ -27,7 +27,7 @@ rendered properly in your Markdown viewer.
# Mistral
[Mistral](https://huggingface.co/papers/2310.06825) is a 7B parameter language model, available as a pretrained and instruction-tuned variant, focused on balancing
[Mistral](https://huggingface.co/papers/2310.06825) is a 7B parameter language model, available as a pretrained and instruction-tuned variant, focused on balancing
the scaling costs of large models with performance and efficient inference. This model uses sliding window attention (SWA) trained with a 8K context length and a fixed cache size to handle longer sequences more effectively. Grouped-query attention (GQA) speeds up inference and reduces memory requirements. Mistral also features a byte-fallback BPE tokenizer to improve token handling and efficiency by ensuring characters are never mapped to out-of-vocabulary tokens.
You can find all the original Mistral checkpoints under the [Mistral AI_](https://huggingface.co/mistralai) organization.
@@ -78,10 +78,10 @@ The example below demonstrates how to chat with [`Pipeline`] or the [`AutoModel`
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```python
echo -e "My favorite condiment is" | transformers-cli chat --model_name_or_path mistralai/Mistral-7B-v0.3 --torch_dtype auto --device 0 --attn_implementation flash_attention_2
echo -e "My favorite condiment is" | transformers chat mistralai/Mistral-7B-v0.3 --torch_dtype auto --device 0 --attn_implementation flash_attention_2
```
</hfoption>

View File

@@ -76,10 +76,10 @@ print(f"The predicted token is: {predicted_token}")
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "The capital of France is [MASK]." | transformers-cli run --task fill-mask --model google/mobilebert-uncased --device 0
echo -e "The capital of France is [MASK]." | transformers run --task fill-mask --model google/mobilebert-uncased --device 0
```
</hfoption>

View File

@@ -79,10 +79,10 @@ print(f"The predicted token is: {predicted_token}")
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers-cli run --task fill-mask --model answerdotai/ModernBERT-base --device 0
echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model answerdotai/ModernBERT-base --device 0
```
</hfoption>

View File

@@ -70,10 +70,10 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "The future of AI is" | transformers-cli run --task text-generation --model openai-community/openai-gpt --device 0
echo -e "The future of AI is" | transformers run --task text-generation --model openai-community/openai-gpt --device 0
```
</hfoption>

View File

@@ -65,10 +65,10 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "'''def print_prime(n): """ Print all primes between 1 and n"""'''" | transformers-cli run --task text-classification --model microsoft/phi-1.5 --device 0
echo -e "'''def print_prime(n): """ Print all primes between 1 and n"""'''" | transformers run --task text-classification --model microsoft/phi-1.5 --device 0
```
</hfoption>
@@ -102,7 +102,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-1",
@@ -110,12 +110,12 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
device_map="auto",
trust_remote_code=True,
attn_implementation="sdpa")
input_ids = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt").to("cuda")
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

View File

@@ -64,7 +64,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-1.5B-Instruct",
torch_dtype=torch.bfloat16,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa"
)
@@ -86,10 +86,10 @@ generated_ids = model.generate(
model_inputs.input_ids,
cache_implementation="static",
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
@@ -100,11 +100,11 @@ print(response)
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
# pip install -U flash-attn --no-build-isolation
transformers-cli chat --model_name_or_path Qwen/Qwen2-7B-Instruct --torch_dtype auto --attn_implementation flash_attention_2 --device 0
transformers chat Qwen/Qwen2-7B-Instruct --torch_dtype auto --attn_implementation flash_attention_2 --device 0
```
</hfoption>
@@ -121,21 +121,21 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-7B",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config,
attn_implementation="flash_attention_2"
attn_implementation="flash_attention_2"
)
inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to("cuda")
inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

View File

@@ -75,10 +75,10 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers-cli">
<hfoption id="transformers CLI">
```bash
echo -e "translate English to French: The weather is nice today." | transformers-cli run --task text2text-generation --model google-t5/t5-base --device 0
echo -e "translate English to French: The weather is nice today." | transformers run --task text2text-generation --model google-t5/t5-base --device 0
```
</hfoption>