Rice Variety Prediction: RF, MLP & Ensemble Methods

by Admin 52 views
📈 Rice Variety Prediction: Random Forest, MLP & Ensemble

Hey guys! This app is designed to help you train Random Forest (RF), Multi-Layer Perceptron (MLP), soft voting hybrid, and stacking classifier models to predict rice varieties based on soil and environmental characteristics. Let's dive in and make some predictions!

How to Use This App:

  1. Prepare Your CSV File: Make sure your CSV file includes the following columns. Don't worry if the names are slightly different; the app will automatically adjust them:
    • N, P, K: Nitrogen, phosphorus, and potassium content (mg/kg)
    • temperature: Temperature (°C)
    • humidity: Humidity (%)
    • ph: Soil pH
    • rainfall: Rainfall (mm)
    • variety: The target variable indicating the rice variety
  2. Upload Your File: Use the file uploader widget below to upload your CSV file.
  3. Train the Models: Click the "Train Models" button to start the training and evaluation process.
  4. Adjust Soft Voting Weights: Play around with the soft voting weights to see how it impacts the hybrid model's performance.
  5. Get Recommendations: Choose a row from the test data or enter your own values to get the top-3 rice variety recommendations with simple, easy-to-understand explanations.

Utility Functions

Preprocessing

To start, the preprocessing steps are very important to ensure the data is in the correct format to increase model performance.

def get_preprocessor(num_cols, cat_cols):
    """Create a preprocessing pipeline for numeric and categorical columns."""
    numeric_transformer = Pipeline(
        steps=[
            ('impute', SimpleImputer(strategy='median')),
            ('scale', MinMaxScaler()),
        ]
    )
    # OneHotEncoder will handle unknown categories gracefully
    ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype=np.float64)
    categorical_transformer = Pipeline(
        steps=[
            ('impute', SimpleImputer(strategy='most_frequent')),
            ('onehot', ohe),
        ]
    )
    ct = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, num_cols),
            ('cat', categorical_transformer, cat_cols),
        ]
    )
    # force dtype float64 after ColumnTransformer
    tofloat = FunctionTransformer(lambda X: np.asarray(X, dtype=np.float64), validate=False)
    return ct, tofloat

This function creates a preprocessing pipeline for both numerical and categorical columns. For numerical columns, it imputes missing values using the median and then scales the data using MinMaxScaler. For categorical columns, it imputes missing values using the most frequent value and then applies one-hot encoding using OneHotEncoder. This ensures that all data is handled appropriately before being fed into the models. The ColumnTransformer applies these transformations to the specified columns. Finally, a FunctionTransformer ensures the output is in float64 format for compatibility.

Model Training

Now that the preprocessing pipeline is configured, the training phase can begin using the following function.

def train_models(X_train, y_train, num_cols, cat_cols):
    """Train RF, MLP, hybrid weights (default 0.5) and stacking classifier."""
    # Build preprocessors
    ct, tofloat = get_preprocessor(num_cols, cat_cols)
    # Random Forest pipeline
    rf_clf = RandomForestClassifier(
        n_estimators=100, random_state=42, n_jobs=-1
    )
    rf_pipe = ImbPipeline(
        steps=[
            ('preprocess', ct),
            ('tofloat', tofloat),
            ('smote', SMOTE(random_state=42)),
            ('clf', rf_clf),
        ]
    )
    rf_pipe.fit(X_train, y_train)
    # MLP pipeline
    mlp_clf = MLPClassifier(
        hidden_layer_sizes=(64, 32, 16),
        activation='relu',
        solver='adam',
        learning_rate_init=0.003,
        alpha=1e-4,
        batch_size='auto',
        max_iter=300,
        early_stopping=False,
        random_state=42,
    )
    mlp_pipe = ImbPipeline(
        steps=[
            ('preprocess', ct),
            ('tofloat', tofloat),
            ('smote', SMOTE(random_state=42)),
            ('clf', mlp_clf),
        ]
    )
    mlp_pipe.fit(X_train, y_train)
    # Stacking classifier
    meta_learner = LogisticRegression(
        max_iter=1000,
        multi_class='multinomial',
        solver='lbfgs',
        C=0.1,
        class_weight='balanced',
        random_state=42,
    )
    stack_clf = StackingClassifier(
        estimators=[('rf', rf_pipe), ('mlp', mlp_pipe)],
        final_estimator=meta_learner,
        stack_method='predict_proba',
        cv=5,
        n_jobs=-1,
    )
    stack_clf.fit(X_train, y_train)
    return rf_pipe, mlp_pipe, stack_clf

This function trains three different models: a Random Forest, a Multi-Layer Perceptron (MLP), and a Stacking Classifier. The Random Forest and MLP models are trained using pipelines that include preprocessing, SMOTE for oversampling, and the respective classifiers. The Stacking Classifier combines the predictions of the Random Forest and MLP models using a Logistic Regression meta-learner. Each model is trained on the provided training data (X_train, y_train). The function returns the trained pipelines and the stacking classifier.

Model Evaluation

To evaluate the performance of each model, the following function is used.

def evaluate_models(rf_pipe, mlp_pipe, stack_clf, X_test, y_test):
    """Return predictions, accuracies, macro‑F1 and confusion matrices for each model."""
    rf_pred = rf_pipe.predict(X_test)
    mlp_pred = mlp_pipe.predict(X_test)
    stack_pred = stack_clf.predict(X_test)
    rf_acc = accuracy_score(y_test, rf_pred)
    mlp_acc = accuracy_score(y_test, mlp_pred)
    stack_acc = accuracy_score(y_test, stack_pred)
    rf_f1 = f1_score(y_test, rf_pred, average='macro')
    mlp_f1 = f1_score(y_test, mlp_pred, average='macro')
    stack_f1 = f1_score(y_test, stack_pred, average='macro')
    rf_cm = confusion_matrix(y_test, rf_pred)
    mlp_cm = confusion_matrix(y_test, mlp_pred)
    stack_cm = confusion_matrix(y_test, stack_pred)
    return {
        'rf': {'pred': rf_pred, 'acc': rf_acc, 'f1': rf_f1, 'cm': rf_cm},
        'mlp': {'pred': mlp_pred, 'acc': mlp_acc, 'f1': mlp_f1, 'cm': mlp_cm},
        'stack': {'pred': stack_pred, 'acc': stack_acc, 'f1': stack_f1, 'cm': stack_cm},
    }

This function takes the trained Random Forest, MLP, and Stacking Classifier models, along with the test data, and returns predictions, accuracy scores, macro-F1 scores, and confusion matrices for each model. These metrics help in understanding the performance and reliability of each model. The function calculates these metrics by comparing the model predictions to the true labels (y_test).

Cached Training and Evaluation

To optimize the application, training and evaluation are cached to avoid repeated computations.

@st.cache_data(show_spinner=False)
def cached_train_and_evaluate(data: pd.DataFrame, test_size: float = 0.2):
    """Train and evaluate models. Caches result to avoid repeated training."""
    # Identify columns
    target_col = 'varietas_padi'
    num_cols = ['n_mg_kg', 'p_mg_kg', 'k_mg_kg', 'suhu', 'kelembaban', 'ph_tanah', 'curah_hujan_mm']
    cat_cols: list[str] = []
    X = data[num_cols + cat_cols].copy()
    y = data[target_col].copy()
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=y
    )
    rf_pipe, mlp_pipe, stack_clf = train_models(X_train, y_train, num_cols, cat_cols)
    eval_result = evaluate_models(rf_pipe, mlp_pipe, stack_clf, X_test, y_test)
    # Compute probabilities for hybrid
    rf_proba = rf_pipe.predict_proba(X_test)
    mlp_proba = mlp_pipe.predict_proba(X_test)
    return {
        'rf_pipe': rf_pipe,
        'mlp_pipe': mlp_pipe,
        'stack_clf': stack_clf,
        'eval': eval_result,
        'X_test': X_test,
        'y_test': y_test,
        'rf_proba': rf_proba,
        'mlp_proba': mlp_proba,
    }

This function trains and evaluates the models, caching the results to avoid repeated training. It splits the data into training and test sets, trains the Random Forest, MLP, and Stacking Classifier models, evaluates their performance, and computes probabilities for the hybrid model. The @st.cache_data decorator ensures that the function is only executed when the input data changes, saving computational time.

Confusion Matrix Plotting

To visualize the performance of each model, confusion matrices are plotted.

def plot_confusion_matrix(cm: np.ndarray, class_names: list[str], title: str):
    """Plot a confusion matrix using seaborn and return a matplotlib figure."""
    fig, ax = plt.subplots(figsize=(4, 3))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=class_names,
        yticklabels=class_names,
        cbar_kws={'label': 'Count'},
        ax=ax,
    )
    ax.set_title(title)
    ax.set_ylabel('True label')
    ax.set_xlabel('Predicted label')
    plt.tight_layout()
    return fig

This function plots a confusion matrix using the seaborn library, providing a visual representation of the model's performance in terms of true and predicted labels. The function takes a confusion matrix cm, class names, and a title as input, and returns a matplotlib figure.

Recommendation Generation

One of the key features of this application is the ability to generate rice variety recommendations.

def recommend_top3(sample: pd.DataFrame, rf_pipe, mlp_pipe, rf_weight: float, X_train: pd.DataFrame):
    """Return top‑3 predicted classes and simple explanations for a single input sample."""
    # Compute probability from hybrid soft voting
    rf_prob = rf_pipe.predict_proba(sample)[0]
    mlp_prob = mlp_pipe.predict_proba(sample)[0]
    hybrid_prob = rf_weight * rf_prob + (1.0 - rf_weight) * mlp_prob
    class_names = rf_pipe.classes_
    ranked_idx = np.argsort(hybrid_prob)[::-1][:3]
    explanations = []
    # For each top class, construct a simple explanation by comparing each feature to mean
    diffs = (sample.iloc[0] - X_train.mean()).to_dict()
    for idx in ranked_idx:
        label = class_names[idx]
        confidence = hybrid_prob[idx]
        # pick features with largest positive differences
        sorted_features = sorted(diffs.items(), key=lambda x: x[1], reverse=True)
        top_feats = [feat for feat, _ in sorted_features[:3]]
        reasons = []
        for feat_name in top_feats:
            # human readable names
            name_map = {
                'n_mg_kg': 'N',
                'p_mg_kg': 'P',
                'k_mg_kg': 'K',
                'suhu': 'Suhu',
                'kelembaban': 'Kelembaban',
                'ph_tanah': 'pH',
                'curah_hujan_mm': 'Curah Hujan'
            }
            simple = name_map.get(feat_name, feat_name)
            val = sample.iloc[0][feat_name]
            mean_val = X_train[feat_name].mean()
            direction = "tinggi" if val >= mean_val else "rendah"
            reasons.append(f"{simple} {val:.2f} (rata² {mean_val:.2f}), {direction}")
        explanations.append({
            'label': label,
            'confidence': confidence,
            'reasons': reasons,
        })
    return explanations

This function returns the top-3 predicted rice varieties and provides simple explanations for each recommendation. It computes the probability of each class using a hybrid soft voting approach, ranks the classes based on their probabilities, and constructs explanations by comparing the input sample's features to the mean values of the training set. The explanations provide insights into why a particular variety is recommended based on the input features.

Streamlit App Logic

The Streamlit app ties all the utility functions together to create a user-friendly interface.

uploaded_file = st.file_uploader("Unggah file CSV", type=["csv"])

if uploaded_file is not None:
    try:
        # Load CSV into DataFrame
        df = pd.read_csv(uploaded_file)
    except Exception as e:
        st.error(f"Gagal membaca file CSV: {e}")
        st.stop()
    # Rename columns if necessary
    rename_map = {
        'N': 'n_mg_kg',
        'P': 'p_mg_kg',
        'K': 'k_mg_kg',
        'temperature': 'suhu',
        'humidity': 'kelembaban',
        'ph': 'ph_tanah',
        'rainfall': 'curah_hujan_mm',
        'variety': 'varietas_padi',
    }
    df = df.rename(columns=rename_map)
    required_cols = ['n_mg_kg', 'p_mg_kg', 'k_mg_kg', 'suhu', 'kelembaban', 'ph_tanah', 'curah_hujan_mm', 'varietas_padi']
    missing_cols = [c for c in required_cols if c not in df.columns]
    if missing_cols:
        st.error(f"Kolom berikut tidak ditemukan pada dataset: {missing_cols}")
        st.stop()
    st.success(f"Dataset berhasil dimuat: {df.shape[0]} baris, {df.shape[1]} kolom")
    st.dataframe(df.head(5))
    # Allow user to set test size
    test_size = st.slider("Test size (persentase data untuk uji)", 0.1, 0.5, 0.2, step=0.05)
    # Trigger training
    if st.button("Train Models"):
        with st.spinner("Melatih model, mohon tunggu..."):
            results = cached_train_and_evaluate(df, test_size)
        st.session_state['results'] = results
        st.success("Pelatihan selesai!")

    # After training, display results
    if 'results' in st.session_state:
        results = st.session_state['results']
        rf_pipe = results['rf_pipe']
        mlp_pipe = results['mlp_pipe']
        stack_clf = results['stack_clf']
        eval_res = results['eval']
        X_test = results['X_test']
        y_test = results['y_test']
        rf_proba = results['rf_proba']
        mlp_proba = results['mlp_proba']
        class_names = rf_pipe.classes_.tolist()
        # Display evaluation metrics summary
        st.subheader("Ringkasan Performansi")
        summary_df = pd.DataFrame({
            'Model': ['Random Forest', 'MLP', 'Stacking'],
            'Accuracy': [eval_res['rf']['acc'], eval_res['mlp']['acc'], eval_res['stack']['acc']],
            'Macro-F1': [eval_res['rf']['f1'], eval_res['mlp']['f1'], eval_res['stack']['f1']],
        }).sort_values('Macro-F1', ascending=False)
        st.dataframe(summary_df.style.format({'Accuracy': '{:.3f}', 'Macro-F1': '{:.3f}'}))
        # Confusion matrices
        st.subheader("Confusion Matrices")
        cols = st.columns(3)
        for idx, (key, title) in enumerate([('rf', 'Random Forest'), ('mlp', 'MLP'), ('stack', 'Stacking')]):
            cm = eval_res[key]['cm']
            fig = plot_confusion_matrix(cm, class_names, title)
            cols[idx].pyplot(fig)
        # Hybrid soft voting
        st.subheader("Hybrid (Soft Voting)")
        weight = st.slider(
            "Bobot kontribusi RF dalam hybrid (0 = MLP saja, 1 = RF saja)",
            0.0, 1.0, 0.5, step=0.05,
        )
        hybrid_prob = weight * rf_proba + (1.0 - weight) * mlp_proba
        hybrid_pred = rf_pipe.classes_[np.argmax(hybrid_prob, axis=1)]
        hybrid_acc = accuracy_score(y_test, hybrid_pred)
        hybrid_f1 = f1_score(y_test, hybrid_pred, average='macro')
        hybrid_cm = confusion_matrix(y_test, hybrid_pred)
        st.write(f"**Hybrid Accuracy:** {hybrid_acc:.3f}, **Macro-F1:** {hybrid_f1:.3f}")
        fig = plot_confusion_matrix(hybrid_cm, class_names, f"Hybrid (α={weight:.2f})")
        st.pyplot(fig)
        # Top‑3 recommendations
        st.subheader("Rekomendasi Top‑3 Varietas (Hybrid)")
        # Select sample from test set or manual input
        selection_mode = st.radio(
            "Pilih cara input data:",
            ("Dari data uji", "Manual")
        )
        sample_df = None
        if selection_mode == "Dari data uji":
            row_idx = st.number_input(
                "Pilih indeks baris dari data uji (0 s/d {} )".format(len(X_test)-1),
                min_value=0, max_value=len(X_test)-1, step=1, value=0,
            )
            sample_df = X_test.iloc[[row_idx]].copy()
            st.write("Nilai input:")
            st.json(sample_df.to_dict(orient='records')[0])
        else:
            # manual inputs for each numeric feature
            manual_input = {}
            st.write("Masukkan nilai fitur manual:")
            for col in ['n_mg_kg', 'p_mg_kg', 'k_mg_kg', 'suhu', 'kelembaban', 'ph_tanah', 'curah_hujan_mm']:
                default_val = float(df[col].mean())
                val = st.number_input(col, value=default_val)
                manual_input[col] = val
            sample_df = pd.DataFrame([manual_input])
            st.write("Nilai input:")
            st.json(manual_input)
        # Generate recommendations
        if sample_df is not None and st.button("Dapatkan Rekomendasi"):
            with st.spinner("Menghitung rekomendasi..."):
                explanations = recommend_top3(sample_df, rf_pipe, mlp_pipe, weight, results['X_test'])
            for rank, rec in enumerate(explanations, start=1):
                st.markdown(f"### {rank}. {rec['label']} – Conf: {rec['confidence']:.3f}")
                st.write("Alasan:")
                for reason in rec['reasons']:
                    st.write(f"- {reason}")
else:
    st.info("Unggah file CSV untuk memulai.")

This part of the app handles file uploading, data loading, column renaming, and training initiation. It also displays the results after training, including evaluation metrics, confusion matrices, and hybrid model performance. The app allows users to input data manually or select a sample from the test set to get rice variety recommendations.

In summary, this Streamlit application offers a comprehensive solution for predicting rice varieties using machine learning techniques. It provides a user-friendly interface for data uploading, model training, evaluation, and recommendation generation. Have fun predicting, everyone!