Building Frugal OpenSource LLM Applications

using Serverless Cloud

Useful for PoCs and Batch Processing Jobs

Motivation

  • Want to build LLM applications?
  • Wondering what is the most cost effective way to learn and build them in cloud?

Think OpenSource LLM.
Think Serverless

Debates that we ARE NOT having today

Or probably could have by the end of session:

OpenSource LLMs vs Paid LLMs
Own Cloud hosted LLM vs Serverless Pay-as-you-go LLM APIs

Note:

  • The above are 2 different debates.
  • You can pay to use the Serverless AWS Bedrock API and but invoke an Open Source LLM model like Mistral AI Instruct.

Purpose of this Presentation

Let us see how the intermingling of 2 concepts - Serverless + Open Source LLMs - help you build demo-able PoC LLM applications, at minimal cost.

#LLMOps
#MLOps
#AWSLambda
#LLMonServerless
#OpenSourceLLMs

LLM Recipes we are discussing today:

  • 1) A Lambda to run inference on a purpose-built Transformer ML Model
    • A Lambda to Anonymize Text using a Huggingface BERT Transformer-based Language Model for PII De-identification
  • 2) A Lambda to run a Small Language Model like Microsoft's Phi3
  • 3) A Lambda to run a RAG Implementation on a Small Language Model like Phi3
  • 4) A Lambda to invoke a LLM like Mistral 7B Instruct
    • the LLM is running in SageMaker Endpoint

1. Lambda to Anonymize Text

  • A Lambda to run inference on a purpose-built ML Model
    • This lambda can Anonymize Text
    • using a Huggingface BERT Transformer-based Fine-tuned Model

https://senthilkumarm1901.github.io/aws_serverless_recipes/container_lambda_anonymize_text/

2. Small Language Model

  • A Lambda to run a Small Language Model like Microsoft's Phi3

https://senthilkumarm1901.github.io/aws_serverless_recipes/container_lambda_to_run_slm/

3. Small Language Model with RAG

  • A Lambda to run a RAG Implementation on a Small Language Model like Phi3, that gives better context

What is RAG, How does RAG improve LLM Accuracy?

Retrieval augmented generation, or RAG, is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data.

Source: Databricks

How does LLM work?

Source: AnyScale Blog: a-comprehensive-guide-for-building-rag-based-llm-applications

How does RAG in LLM work?

Source: RealPython Blog: chromadb-vector-database

How is a Vector DB created

Source: AnyScale Blog: a-comprehensive-guide-for-building-rag-based-llm-applications

Detour: If you wish to use other Vector databases

Source: Data Quarry Blog: Vector databases - What makes each one different?

  • URL we are testing on is from my favorite DL/NLP Researcher.
    • https://magazine.sebastianraschka.com/p/understanding-large-language-models

https://senthilkumarm1901.github.io/aws_serverless_recipes/container_lambda_to_run_rag_slm/

4. Large Language Model (A Partial Serverless)

  • A Lambda to invoke a LLM like Mistral 7B Instruct
    • that is running in SageMaker Endpoint

https://senthilkumarm1901.github.io/aws_serverless_recipes/lambda_to_invoke_a_sagemaker_endpoint/

Exploring Some of the Answers from the LLMs

Key Challenges Faced

  • Serverless could mean we end up with low end cpu architecture. Hence, latency high for RAG LLM implementations
  • RAG could mean any big context. But converting the RAG context into a vector store will take time. Hence size of the context needs to be lower for "AWS Lambda" implementations
  • Maximum timelimit in Lambda is 15 min. API Gateway times out in 30 seconds. Hence could not be used in RAG LLM implementation

What knowledge you gain by this way of practice?

MLOps Concepts:

  • Dockerizing ML Applications. What works in your machine works everywhere. More than 70% of the time building these LLM Apps is in perfecting the dockerfile.
  • The art of storing ML Models in AWS Lambda Containers. Use cache_dir well. Otherwise, models get downloaded everytime docker container is created
os.environ['HF_HOME'] = '/tmp/model' #the only `write-able` dir in AWS lambda = `/tmp`
...
...
your_model="ab-ai/pii_model"
tokenizer = AutoTokenizer.from_pretrained(your_model,cache_dir='/tmp/model')
ner_model = AutoModelForTokenClassification.from_pretrained(your_model,cache_dir='/tmp/model')

AWS Concepts:

  • aws cli is your friend for shorterning deployments, especially for Serverless
  • API Gateway is a frustratingly beautiful service. But a combination of aws cli and OpenAPI spec makes it replicable
  • AWS Lambda Costing is awesomely cheap for PoCs
## AWS Lambda ARM Architecture Costs (assuming you have used up all your free tier)
Number of requests: 50 per day * (730 hours in a month / 24 hours in a day) = 1520.83 per month
Amount of memory allocated: 10240 MB x 0.0009765625 GB in a MB = 10 GB
Amount of ephemeral storage allocated: 5120 MB x 0.0009765625 GB in a MB = 5 GB

Pricing calculations
1,520.83 requests x 120,000 ms x 0.001 ms to sec conversion factor = 182,499.60 total compute (seconds)
10 GB x 182,499.60 seconds = 1,824,996.00 total compute (GB-s)
1,824,996.00 GB-s x 0.0000133334 USD = 24.33 USD (monthly compute charges)
1,520.83 requests x 0.0000002 USD = 0.00 USD (monthly request charges)
5 GB - 0.5 GB (no additional charge) = 4.50 GB billable ephemeral storage per function
4.50 GB x 182,499.60 seconds = 821,248.20 total storage (GB-s)
821,248.20 GB-s x 0.0000000352 USD = 0.0289 USD (monthly ephemeral storage charges)
24.33 USD + 0.0289 USD = 24.36 USD

Lambda costs - Without Free Tier (monthly): 24.36 USD
  • If I run a c5.large (minimal CPU) EC2 instance running throughout the month, cost = 60 USD
  • If I run a g4dn.large (minimal GPU) EC2 instance running throughout the month, cost = 420 USD

Finally, the LLM Concepts:

  • Frameworks: Llama cpp, LangChain, LlamaIndex, Huggingface (and so many more!)
  • SLMs work well with Reasoning but are too slow/bad for general knowledge questions

Models are like wines and these LLM frameworks are like bottles. The important thing is the wine more than the bottle. But getting used to how the wines are stored in the bottles help.

Next Steps for the author:

  • Codes here may not be fully efficient! We can further reduce cost if run time is reduced


For Phi3-Mini-RAG:

  • Try leveraging a better embedding model (apart from the ancient Sentence Transformers)
  • What about other vector databases? - Like Pinecone Milvus (we have used opensource Chromodb) here
  • Idea to explore: Rust for LLMs. Rust for Lambda.

Sources:

  • Rust ML Minimalist framework - Candle: https://github.com/huggingface/candle
  • Rust for LLM - https://github.com/rustformers/llm
  • Rust for AWS Lambda - https://www.youtube.com/watch?v=He4inXmMZZI

Next Steps for the reader:

  • Replicate the instructions in the given Github links
    • Familiarizing Dockerizing of ML Applications
    • Provisioning AWS Resources like AWS Lambda, API Gateway using tools like aws cli and OpenAPI
  • Explore various other avenues of using LLMs (especially the paid ones). Paid APIs are cakewalk compared to this. But won't give you the depth in implementations

github.com/senthilkumarm1901/serverless_nlp_app

Thank You