Mastering Data Science With Microsoft Fabric: A Tutorial For Beginners

Introduction:

Microsoft Fabric is a cloud-based platform that offers a unified data science, data engineering, and business intelligence experience. It provides a variety of features and services, such as data preparation, machine learning, and visualization. Fabric’s comprehensive toolset enables data professionals and business users equally to unlock the full potential of their data and shape the future of AI.

Fabric’s core offered services such as Data Factory, Synapse Data Engineering, Synapse Data Science, Synapse Data Warehousing, Synapse Real-Time Analytics, and Power BI. Fabric provides a comprehensive and powerful solution for your data science needs, ranging from data integration and engineering to real-time analytics and visualization.

In this blog our focus will be on Fabric’s data science services, we will show how to use Microsoft Fabric to build a diabetic prediction model and will explore the remarkable tools of the notebook.

To access Microsoft Fabric create an account on app.fabric.microsoft.com for a free trial or if you are an existing Power BI customer you can sign in using your Power BI account credentials.

Check out our blog on Mastering Data Science with Microsoft Fabric: Introduction to Fabric Notebook Features to learn how to use amazing capabilities that will enhance your data exploration and experimentation process.

Fabric Lakehouse and Notebooks:

To start with our Diabetes prediction we will use the Diabetes dataset “pima-indians-diabetes” from the Kaggle dataset, which contains data on over 768 patients with diabetes.

When we refer to data, we may talk about storing structured and unstructured data. Fabric’s Lakehouse is one of the objects that can store data and is a data architecture platform for managing and analyzing data. It has the ability to expand and adapt to manage huge amounts of data and helps various kinds of data processing tools and frameworks. To know more about Data Lakehouse refer What is a lakehouse in Microsoft Fabric?

The Fabric utilizes the notebook artifact within the Data Science experience to demonstrate the Fabric framework’s diverse capabilities. The Fabric allows the use of notebooks for the purpose of developing machine learning experiments and facilitating their deployment. The Data Science service and notebook provide a wide range of features, which will be discussed further. You can refer to this How to use Microsoft Fabric notebooks to know more about Data Science services

Follow the below steps to store files/data in Lakehouse:

1. Go to the Microsoft Fabric home and select Data Engineering from the menu.

2. Create a new Lakehouse

3. Upload files from your local device. You will see updated files in the existing “Files” folder.

Now let’s see how we can train our model for Diabetes prediction.

4. You can either create a new notebook or import an existing notebook from the Data Engineering home page (shown in the image in step no. 2) or from the Data Science home page as shown in the below image

5. Connect Lakehouse with your notebook, you either create a new one or connect the existing Lakehouse.

6. Please follow this notebook code to train the machine-learning model of Diabetes prediction.

Machine Learning Model Training and Prediction Scoring

This section walks through the steps involved in training a Scikit-Learn model, including the process of saving the trained models. Furthermore, it demonstrates how to utilize the saved model for predictions once the training procedure is complete. To know more about models in Fabric please refer to How to train models with scikit-learn in Microsoft Fabric.

Please note that the code provided in this section is specifically designed for Microsoft Fabric Notebook. Attempting to run the code on other platforms such as Colab or any other platform may result in errors. This is because the PREDICT function utilized in the code requires the models to be saved in the MLflow format, which is primarily supported by Spark language.

1. A machine learning experiment is the basic organizing and management unit for all connected machine learning runs. To make an experiment for the trained model run the below code.

				
					import mlflow
mlflow.set_experiment("Diabetes-Prediction")

It will create a new experiment named “Diabetes-Prediction” in your workspace. You can check Machine learning experiments in Microsoft Fabric to know more about “Experiment”.

Or you can create an experiment using UI (from your workspace select experiment from dropdown)

2. The following code shows how to use the MLflow API to create a machine learning experiment and launch an MLflow run for an LGBMClassifier model built with the scikit-learn library. After that, the model’s version is saved and registered in the Microsoft Fabric workspace.

In the below code, write your model name in mlflow.sklearn.log_model()

				
					import mlflow.sklearn
from mlflow.models.signature import infer_signature

mlflow.set_experiment("Diabetes-Prediction")
with mlflow.start_run() as run:
    model = LGBMClassifier(random_state = 12345)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    score = model.score(X_train, y_train)
    signature = infer_signature(X, y)

    print('score...:',score)
    print('Accuracy...:',accuracy)
    mlflow.sklearn.log_model(
        model,
        "diabetes-model",
        signature=signature,
        registered_model_name="diabetes-model"
    )

3. Once the model has been saved, it can be loaded for the purpose of inference. In order to accomplish this, we will load the model and execute the inference process on a sample dataset. Please refer to the below code to make prediction on your testing data.

				
					from pyspark.sql import SparkSession
from synapse.ml.predict import MLFlowTransformer

spark = SparkSession.builder.getOrCreate()
test = spark.read.format("csv").option("header","true").load("Files/diabetes_test.csv")
# df now is a Spark DataFrame containing CSV data from "Files/diabetes_test.csv".
display(test)

# You can substitute values below for your own input columns,
# output column name, model name, and model version
model = MLFlowTransformer(
    inputCols=test.columns,
    outputCol='predictions',
    modelName='diabetes-model',
    modelVersion=1
)
prdiction = model.transform(test).show()
pred_df = prdiction.toPandas()

Replace inputCols, modelName, and modelVersion, with your feature columns of test dataset, model name, and model version.
Or if you want to do it using UI, you can generate the above PREDICT code from a model’s item page for inference testing data.
Open the model from your workspace, where you have saved it

Select that model version from the sidebar, click on the “Apply model” button, and select “Apply this model in the wizard”. As shown in below image.

Follow the steps for the left sidebar outlined in the below image, from the Generate PREDICT code from a model’s item page and enter the notebook name where you want to save code the inference.

You will see the generated code in the given notebook

The “prediction” column will be added to your test data frame by running the below command.

This way you can use Fabric Notebook for your data science experiments.

Pragnakalp Techlabs: Your trusted partner in Python, AI, NLP, Generative AI, ML, and Automation. Our skilled experts have successfully delivered robust solutions to satisfied clients, driving innovation and success.

Hire Dedicated Developers

Services

Contact Us