[{"content":"BACKGROUND STORY Creating a portfolio website has been a personal project todolist for so many years. But, there were a lot of obstacles (mostly personal motivations) to clearly make one.\nThe more legit reasons are:\nAs many typical website builders (Wix, Squarespace, etc.) will require cost to host, which one is worth it to try for my very early stage build? Most of these are for eCommerce, and I don\u0026rsquo;t need many of these tools. Is there a pre-made theme among these websites that fit to my need? How to organize the website for what I can present/provide? I am not in Art \u0026amp; Design field where I have many pictures or showcase products. A MORE ADVANCED PATH (why run before walk) Sometime ago, I found portfolio website from someone in tech. The theme was not what I have seen in common website builders. At the footer of the page, it says \u0026lsquo;Built with Hugo, through Netlify\u0026quot;. At that time, I have not coded with Python, Linux, or anything. I was not even familiar with Github etc. I just want to use that theme.\nSo, I tried to follow any articles I could find. But I could not finish testing and got a job that was not that technical anyway. So I thought, I wouldn\u0026rsquo;t need it anyway; so I stopped.\nNOW Fast forward, I found out that Github can host a simple static website through the pages. So, I guess it is time to finally check that todolist\u0026hellip; except,\u0026hellip;\nMORE HEADACHE {::comment}comment {:/comment}\n","date":"2026-01-01T00:00:00Z","image":"https://raffwh.github.io/p/hello-world/cover_hu_9e6bcf9cfe9a9448.jpg","permalink":"https://raffwh.github.io/p/hello-world/","title":"Hello World of Website Building"},{"content":"Overview This post documents an end-to-end supervised machine learning workflow built in Python during my graduate coursework at Boston University. The goal was to build a reproducible pipeline that handles data preprocessing, feature selection, model training, and evaluation — the kind of workflow that translates directly to real-world analytics problems.\nProblem \u0026amp; Dataset The analysis used a structured tabular dataset with a mix of numeric and categorical features. The target variable was a binary classification outcome. The challenge was to:\nHandle missing values and skewed distributions thoughtfully Engineer features that improve signal without data leakage Select a modeling strategy that generalizes well Pipeline Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.feature_selection import SelectKBest, f_classif # Numeric and categorical transformers numeric_transformer = Pipeline(steps=[ (\u0026#39;scaler\u0026#39;, StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ (\u0026#39;encoder\u0026#39;, OneHotEncoder(handle_unknown=\u0026#39;ignore\u0026#39;)) ]) # Column transformer preprocessor = ColumnTransformer(transformers=[ (\u0026#39;num\u0026#39;, numeric_transformer, numeric_features), (\u0026#39;cat\u0026#39;, categorical_transformer, categorical_features) ]) # Full pipeline with feature selection + model pipeline = Pipeline(steps=[ (\u0026#39;preprocessor\u0026#39;, preprocessor), (\u0026#39;selector\u0026#39;, SelectKBest(score_func=f_classif, k=15)), (\u0026#39;classifier\u0026#39;, GradientBoostingClassifier()) ]) Cross-Validation \u0026amp; Model Comparison Models were compared using Stratified K-Fold cross-validation (k=5) to account for class imbalance and reduce variance in performance estimates.\nModel CV AUC (mean ± std) Logistic Regression 0.81 ± 0.03 Random Forest 0.87 ± 0.02 Gradient Boosting 0.89 ± 0.02 1 2 3 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipeline, X, y, cv=cv, scoring=\u0026#39;roc_auc\u0026#39;) print(f\u0026#34;AUC: {scores.mean():.3f} ± {scores.std():.3f}\u0026#34;) Feature Importance After fitting the final Gradient Boosting model, feature importances were extracted and visualized using matplotlib. The top features aligned well with domain knowledge, providing a sanity check on the model.\nKey Takeaways Pipeline design matters: keeping preprocessing inside a Pipeline object prevents data leakage during cross-validation, a subtle but critical mistake to avoid. Model selection: Gradient Boosting outperformed simpler models, but the gain wasn\u0026rsquo;t free — it required careful regularization to avoid overfitting on the training set. Feature selection: Using SelectKBest inside the pipeline reduced noise features and slightly improved generalization. Tools Used Full code available in: BU_OMDS_SU25_DX699B_RW\n","date":"2026-03-01T00:00:00Z","permalink":"https://raffwh.github.io/p/ml-pipeline-sklearn/","title":"ML Pipeline: Predicting with Scikit-Learn"},{"content":" Line tl;dr Basically, AREDS clinical trial grading data was found to be flip-floping for our analysis purposes. We defined a certain criteria for our research that is unique but could cause confusion due to the complexity of the formula or uncommon technique in this dataset.\nTo address this, I created a simple visualization to help explain the flip-flop issue, and the result using our novel code. Another good thing about this visualization is that, the code to make this graph can easily \u0026amp; quickly be modified depending if reviewers/leaders want to see a different version. The original visualization was created using R + ggplot2.\nBACKGROUND STORY AREDS 1 overview AREDS was a multiyear longitudinal clinical trial on age-related macular degeneration, looking to see the effect of supplements on the progression of the disease. During the first AREDS trial, the grading system utilized 9 steps plus progression to the advanced AMD.\nMany papers have been published and explained the trial design and results. So I won\u0026rsquo;t go into details here.\n\u0026ndash;\nResearch Objective While many studies now used the simplified scale, we used the full scale to create a more granular analysis on the time-based progression of the disease.\nSeverity scale 1-8: was for the non-advanced AMD grade.\nSeverity scale 9: was for the advanced AMD grade.\nmodified severity scale 9, to include 10, 11, 12: was our modification to include GA and NV as part of the continuous scale.\n\u0026ndash;\nStory Since we analyzed it in very granular level, we hit an issue where the count of advanced AMD was different between timepoints.\nUpon further investigation, we found that there were many cases where an eye was graded as progressed to advanced stage but then went back to non-advanced stage.\nSince we wanted to analyze this progression like survival analysis, we had to make sure that a progressed eye should not go back to non-advanced stage.\nIn some cases, the advanced grade only occured once in the middle of the study period and then back to very early stage (scale \u0026lt;6). In other cases, the time when the eye progressed to advanced stage occurred multiple times. So, we developed a criteria to only select the eye as progressed eye, when the grade or status had passed the progression status/scale, at least at 2 time points in a sequence.\nThen we realized the system was quite abstract.\nSo, I developed a visualization to show the progression and the flip flopping issue. \u0026ndash;\nGRAPH PREVIEW Understanding the graph: The graph shows the typical grading for the subjects in AREDS 1 trial.\nThe usual grade was taken after baseline visit was within 2 years. Different subjects had different total follow-up time. The main issue is represented on ID 5, 6, and 7 (eye ID, not the same as the AREDS ID, the replacement is just for better visualization).\nEye went to advanced stage on 2nd grading (scale 11), but went back to low (scale 4) on the 3rd visit. It even reached healthy status (scale 1) on the 5th visit. We decided that on these cases, the progression could be due to some human errors or study design error (blinded graders). These cases were the reason we developed a different criteria in defining when an eye was progressed to advanced stage. The next few cases are representing how we selected the time of the progression. Eye 11 is a good example on how we skip the first progression (scale 11) as the following grades were back to non-advanced stage. When the eye finally had advanced stage grade back to back, we defined that this was a true progression. But instead of selecting the first grade, we chose the second timepoint for the true termination point. Other version CONCLUSION Snippet for the R code is in my github.\nThe code originally was developed for our internal meeting where we wanted to view and compare individual ID/grades when we came across some issues. So, it was originally created so that we could quickly review each id and the code has more features such as multiple progression criteria.\nWe simplified the visualization a bit for the publication purpose.\nCITATION Seddon JM, Widjajahakim R, Rosner B. Rare and Common Genetic Variants, Smoking, and Body Mass Index: Progression and Earlier Age of Developing Advanced Age-Related Macular Degeneration. Invest Ophthalmol Vis Sci. 2020;61(14):32. doi:10.1167/iovs.61.14.32 ","date":"2020-12-02T00:00:00Z","image":"https://raffwh.github.io/p/areds-score-flipflop/sevscale_supFig_year_20201113_g_hu_7dfb0a3790d3db6f.webp","permalink":"https://raffwh.github.io/p/areds-score-flipflop/","title":"Visualizing abstract formula for research paper"},{"content":"THE STORY When I was working at UMass Medical School, the Ophthalmology Department was looking to update the department logo. Somehow I was having design fever again and wanted to dust off my design skill. I have been involved in my lab\u0026rsquo;s website and knew many people in the department, so why not.\nDESIGN IDEA / THINKING ICON / BASE DESIGN Our lab (and many other research-side teams) has several genetic projects. I was looking into how to merge / mix / assimilate something that is perceived as genetic with something that is like an eye.\nGENETIC Few ideas I had about some visuals that are common in genetic studies are:\nBlots (like western blot, Southern blot, northwestern blot, etc.) Marking of mutations in a chromosome Exon splices EYE I guess there is not much variety or there is no need for an out-of-the-box idea for an eye\nA full eye, anterior view Sagittal view of eyeball The iris COLOR I had known about Pantone way before, but Pantone color of the year somehow became a big news recently and it felt the color had become a trendy influencers topic. The color of the 2021 was somehow a two-colors instead of one. There some info on the meaning of each color, and I thought they could fit to the department vision or current situation. So I decided to just use these 2 colors.\nCREATION I came up with three different styles.\nCOMMON AMONG 3 I couldn\u0026rsquo;t come up with three totally different ideas; so, these 3 have very similar workflows:\nI used Adobe Illustrator circular pattern\nFirst, I created a custom brush, based on some pattern of genetic icons I came up earlier. Then, I created a circle line vector. Finally, I applied the line using custom brush instead of a solid line. The 2021 text is the same on all three\nThe 20 uses the gray color to mark both the unchanging part of the year (the two thousand), as well as that we just passed the year 2020, and the 2020 being marked gray can mean as the off. The 21 uses the yellow (Illuminating) to mark the new year, and maybe the new meaning (we were just starting to get used to COVID19 life anyway). ART 1 The eye part is based on the iris, while the line is more like line-chart but I can say it is like exon splices too. There is a flare on the inner lower left, just mimicking how eye icon has flare to show the transparency.\nART 2 The eye base is technically a full front eye. The original logo of UMass Eye Center used this style of eye as well (and many other eye clinics). The line part was inspired a bit from the Southern blot, but technically it was created using R as a stacked bar graph.\nART 3 The idea is similar to Art 1, but there is more flair added. The line part was like chromosome even though it was created with random ellipse tools.\nSELECTION In the end I selected Art 2 since I like the stacked bar graph line. Art 1 is good too, but it doesn\u0026rsquo;t really show like an eye. Art 3 is just too busy.\nIN THE END The deparment had a form to submit the designs. I even created a report form to explain my thinking: But the decision makers in the end decided not to change their old logo at all 😓.\n","date":"2020-12-01T00:00:00Z","image":"https://raffwh.github.io/p/eye-center-logo-2021/art_2_hu_702dee104394fc83.jpg","permalink":"https://raffwh.github.io/p/eye-center-logo-2021/","title":"Eye Center Logo 2021"},{"content":"BACKGROUND STORY While data analytic work now have plenty of platform and tools, many business are still using Excel as their main program/software/app.\nSome of my works used Excel for a simple dashboard or even a work report. Sometimes I saw some Excel dashboard that is quite complicated and I have seen my colleagues were confused on how to use it.\nIn other cases, we received different datasets in Excel files. They could be a new data or just a transformed version of the larger data. But sometimes, we received the files almost every day due to the nature of the project. This has caused some confusions on what the data is about, or worse, the version of the revision (when the filename itself does not tell you about it).\nSOLUTION We tried to duplicate how a good data science project will have a README file. But for Excel, we can always have one tab for this readme, while the other tab is the content of the file.\n","date":"2020-08-22T00:00:00Z","permalink":"https://raffwh.github.io/p/readme-excel/","title":"Readme for Excel file"}]