Home About Blog pRojects

Why R?

The first edition of Polish R Users Conferences called Why R? took place on 27-29 September at Warsaw University of Technology - Faculty of Mathematics and Information Science. The Polish R community is very strong. Dig in into the post to find out what topics did we cover and how the event was held. Pay attention to gifts we’ve prepared for invited speakers!

Use switch() instead of ifelse() to return a NULL

Have you ever tried to return a NULL with the ifelse() function? This function is a simple vectorized workflow for conditional statements. However, one can’t just return a NULL value as a result of this evaluation. Check a tricky workaround solution in this post.

How successful can an R meetup be? meet(R) in Tricity! - RSelenium and Big Data processing

At Thursday (12.01.2017) we had a chance to attend the first TriCity R Users Group (Pomerania, Poland) meeting. The meetup was unexpectedly very successful! The success can be measured in the time attendees spent on ardently comments and questions after each of 2 great presentations. After every 20-25 min long presentation we could observe 30 min long lively discussion! It is amazing that questions lasted longer than presentations. Is it thanks to the climate? Is it due to the nature of a Pomeranian community? Perhaps this is due to excellent organization? In this post I present summary of the meeting, I describe presentations and reveal organizers’ identity.

Entropy Based Image Binarization with imager and FSelectorRcpp

The image processing and the computer vision have gained a significant interest in last 2 decades. The image analysis can be used to detect items or people on images and videos. It is widely used in the medicine to detect cancer tissues and to improve brain, lungs and heart diseases diagnostic. The computer automation enabled analyzing terabytes of an image data, based on which we improve our life status and get insights for business decisions. In this post I present basic operations that can be applied to a simple image, all thanks to imager package by which I am truly impressed. I also present a quick entropy approach to the image binarization, which applied to images on a greyscale transforms them to the binarized black-and-white output.

Determine optimal cutpoints for numerical variables in survival plots

The often demand in the biostatistical research is to group patients depending on explanatory variables that are continuous. In some cases the requirement is to test overall survival of the subjects that suffer on a mutation in specific gene and have high expression (over expression) in other given gene. To visualize differences in the Kaplan-Meier estimates of survival curves between groups, first the discretization of continuous variable is performed. Problems caused by categorization of continuous variables are known and widely spread (Harrel, 2015), but in this case there appear a simplification requirement for the discretization. In this post I present the maxstat(maximally selected rank statistics) statistic to determine the optimal cutpoint for continuous variables, which was provided it in the survminer package by Alboukadel Kassambara kassambara.

News from archivist 2.0 on eRum2016 conference

Ten days ago eRum2016 conference (European R Users Meeting 2016) has finished. It was a huge event that attracted over 250 attenders, both from academia and business. Meeting was a great opportunity to listen to amazing keynotes like Heather Turner, Katarzyna Stapor, Rasmus Bååth, Jakub Glinka, Ulrike Grömping, Przemyslaw Biecek, Romain Francois, Marek Gagolewski, Matthias Templ and Katarzyna Kopczewska. Big thank you goes to the whole organizing committee and dr Maciej Beręsewicz (head) especially! There were 10 workshops, 2 packages sessions, 2 data workflow sessions, 3 methodolody sessions, 1 BioR session, 2 business sessions, lightnings talks, a poster session and of course a great welcome paRty. I could not miss a chance to present news from the last release (ver 2.0) of ours archivist package.

Rocker - explanation and motivation for Docker containers usage in applications development

What is R? I was asked at the end of my presentation on the 10th Cracow R Users Meetup that was held last Friday (30.09.2016). I felt strange but absolutely confirmed that R is the language of Data Science and is designed to performed the statistical data analysis. Later I found out that few of listeners came to the meetup to listen more about Docker than R, as my topic was Rocker - explanation and motivation for Docker containers usage in applications development. In this post I present the overview of my presentation. If you are not familiar with using Dockers in R applications development, then this is a must read for you!

What Every R Package Must (REALLY) Contain? An Example on the eRum2016 Package

The R package development is a complex process of creating (mostly) a useful software, that will (probably) be used by other users. This means the provided tool should be resistant, immune, well tested and properly documented. Developers from many different languages have invented various approaches to improve software development, creating documentation or package testing. R users have adapted few of them and mostly we use travis for continuous integration, roxygen2 for documentation, devtools for testing and knitr / rmarkdown for writing manuals, tutorials, vignettes and package websites. This software development kit causes that the R package structure is rather broad, especially since many of us (R developers) puts source code from different languages into the package root to speed up the performance of created tools. Moreover we built our software libraries that are based on other packages, which complicates the NAMESPACE of the prepared package and forces the understanding of difference between dependent, imported and suggested packages. In this whole ecosystem of development equipment and requirements for proper package structure I’ve been asked What Every R Package Must Contain? You wouldn’t guess how easy was the answer.

Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library

Machine and statistical learning wizards are becoming more eager to perform analysis with Spark ML library if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster - let’s say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I present sparklyr package (by RStudio), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it - user sparklyr!

BioC 2016 Conference Overview and Few Ways of Downloading TCGA Data

Few weeks ago I have a great pleasure of attending BioC 2016: Where Software and Biology Connect Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their experiences and to better explain their work, as the understanding between collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the knowledge that I have learned during the event, i.e. about many ways of downloading The Cancer Genome Atlas data.

Survival plots have never been so informative

Hadley Wickham’s ggplot2 version 2.0 revolution, at the end of 2015, triggered many crashes in dependent R packages, that finally led to deletions of few packages from The Comprehensive R Archive Network. It occured that survMisc package was removed from CRAN on 27th of January 2016 and R world remained helpless in the struggle with the elegant visualizations of survival analysis. Then a new tool - survminer package, created by Alboukadel Kassambara - appeared on the R survival scene to fill the gap in visualizing the Kaplan-Meier estimates of survival curves in elegant grammar of graphics like way. This blog presents main features of core ggsurvplot() function from survminer package, which creates the most informative, elegant and flexible survival plots that I have seen!

R 3.3.0 is another motivation for Docker

Have you ever encountered R packages versioning issues when one application required different dependent packages versions than other? Have you ever got stuck with your project because of wrong pre-installed software versions on machine on which you should run your code? Or maybe you had heavy adventures with installing R software on a new machine because you couldn’t recall all installation steps like; what have I done 2 years ago that RCurl works on my local machine but I can’t install it now on my virtual machine with Windows? Or maybe installation of your R project on new machine was easy but admin couldn’t manage with this process, as he’s not regular R user? If you ever find it problematic to move your R applications to other machines, then this Docker guid post is for you!

RTCGA factory of R packages - Quick Guide

Yesterday we have been delivered with the new version of R - R 3.3.0 (codename Supposedly Educational). This enabled Bioconductor (yes, not all packages are distributed on CRAN) to release it’s new version 3.3. This means that all packages held on Bioconductor, that were under rapid and vivid development, have been moved to stable-release versions and now can be easily installed. This happens once or twice a year. With that date I have finished work with RTCGA package and released, on Bioconductor, the RTCGA Factory of R Packages. Read this quick guide to find out more about this R Toolkit for Biostatistics with the usage of data from The Cancer Genome Atlas study.

Answers to FAQ about SparkR for R users

Many people keep asking me whether I have tried SparkR, is it worth using, is it sexy or WHAT is it at all. I felt that creating frequently asked questions (FAQ) in the field of WHAT is that Spark/SparkR? would help many R Scientists to understand this Big Data Buzz-tool. I have gathered information from the documentation and some code from stackoverflow questions in preparation for the list below.